Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Parjdev

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1130

Ascential DataStage

for Ascential DataStage Enterprise Edition

Parallel Job Developers Guide


Version 7.5.1

Part No. 00D-023DS751 December 2004

his document, and the software described or referenced in it, are confidential and proprietary to Ascential Software Corporation ("Ascential"). They are provided under, and are subject to, the terms and conditions of a license agreement between Ascential and the licensee, and may not be transferred, disclosed, or otherwise provided to third parties, unless otherwise permitted by that agreement. No portion of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of Ascential. The specifications and other information contained in this document for some purposes may not be complete, current, or correct, and are subject to change without notice. NO REPRESENTATION OR OTHER AFFIRMATION OF FACT CONTAINED IN THIS DOCUMENT, INCLUDING WITHOUT LIMITATION STATEMENTS REGARDING CAPACITY, PERFORMANCE, OR SUITABILITY FOR USE OF PRODUCTS OR SOFTWARE DESCRIBED HEREIN, SHALL BE DEEMED TO BE A WARRANTY BY ASCENTIAL FOR ANY PURPOSE OR GIVE RISE TO ANY LIABILITY OF ASCENTIAL WHATSOEVER. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL ASCENTIAL BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. If you are acquiring this software on behalf of the U.S. government, the Government shall have only "Restricted Rights" in the software and related documentation as defined in the Federal Acquisition Regulations (FARs) in Clause 52.227.19 (c) (2). If you are acquiring the software on behalf of the Department of Defense, the software shall be classified as "Commercial Computer Software" and the Government shall have only "Restricted Rights" as defined in Clause 252.227-7013 (c) (1) of DFARs. This product or the use thereof may be covered by or is licensed under one or more of the following issued patents: US6604110, US5727158, US5909681, US5995980, US6272449, US6289474, US6311265, US6330008, US6347310, US6415286; Australian Patent No. 704678; Canadian Patent No. 2205660; European Patent No. 799450; Japanese Patent No. 11500247. 2005 Ascential Software Corporation. All rights reserved. DataStage, EasyLogic, EasyPath, Enterprise Data Quality Management, Iterations, Matchware, Mercator, MetaBroker, Application Integration, Simplified, Ascential, Ascential AuditStage, Ascential DataStage, Ascential ProfileStage, Ascential QualityStage, Ascential Enterprise Integration Suite, Ascential Real-time Integration Services, Ascential MetaStage, and Ascential RTI are trademarks of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. The software delivered to Licensee may contain third-party software code. See Legal Notices (legalnotices.pdf) for more information.

How to Use this Guide

This guide describes features of the DataStage Manager and DataStage Designer. It is intended for application developers and system administrators who want to use DataStage to design and develop data warehousing applications using parallel jobs. If you are new to DataStage, you should read the DataStage Designer Guide and the DataStage Manager Guide. These provide general descriptions of the DataStage Manager and DataStage Designer, and give you enough information to get you up and running. This manual contains more specific information and is intended to be used as a reference guide. It gives detailed information about parallel job design and stage editors. For more advanced information, see Parallel Job Advanced Developers Guide. To find particular topics you can: Use the Guides contents list (at the beginning of the Guide). Use the Guides index (at the end of the Guide). Use the Adobe Acrobat Reader bookmarks. Use the Adobe Acrobat Reader search facility (select Edit Search). The guide contains links both to other topics within the guide, and to other guides in the DataStage manual set. The links are shown in blue. Note that, if you follow a link to another manual, you will jump to that manual and lose your place in this manual. Such links are shown in italics.

Documentation Conventions
This manual uses the following conventions:
Convention
Bold

Usage
In syntax, bold indicates commands, function names, keywords, and options that must be input exactly as shown. In text, bold indicates keys to press, function names, and menu selections.

Parallel Job Developers Guide

iii

Documentation Conventions

How to Use this Guide

Convention
UPPERCASE Italic

Usage
In syntax, uppercase indicates BASIC statements and functions and SQL statements and keywords. In syntax, italic indicates information that you supply. In text, italic also indicates UNIX commands and options, file names, and pathnames. In text, plain indicates Windows commands and options, file names, and path names. The Lucida Typewriter font indicates examples of source code and system output. In examples, Lucida Typewriter bold indicates characters that the user types or keys the user presses (for example, <Return>). Brackets enclose optional items. Do not type the brackets unless indicated. Braces enclose nonoptional items from which you must select at least one. Do not type the braces. A vertical bar separating items indicates that you can choose only one item. Do not type the vertical bar. Three periods indicate that more of the same type of item can optionally follow. A right arrow between menu commands indicates you should choose each command in sequence. For example, Choose File Exit means you should choose File from the menu bar, then choose Exit from the File pull-down menu. The continuation character is used in source code examples to indicate a line that is too long to fit on the page, but must be entered as a single line on screen.

Plain
Lucida Typewriter Lucida Typewriter

[] {}
itemA | itemB ...

This line continues

The following conventions are also used:

Syntax definitions and examples are indented for ease in reading.

All punctuation marks included in the syntaxfor example, commas, parentheses, or quotation marksare required unless otherwise indicated. Syntax lines that do not fit on one line in this manual are continued on subsequent lines. The continuation lines are indented. When entering syntax, type the entire syntax entry, including the continuation lines, on the same input line.

iv

Parallel Job Developers Guide

How to Use this Guide

DataStage Documentation

User Interface Conventions


The following picture of a typical DataStage dialog box illustrates the terminology used in describing user interface elements:
Drop Down List The General Tab Field The Inputs Page

Browse Button

Check

Option Button

Box

Button

The DataStage user interface makes extensive use of tabbed pages, sometimes nesting them to enable you to reach the controls you need from within a single dialog box. At the top level, these are called pages at the inner level these are called tabs In the example , . above, we are looking at the General tab of the Inputs page. When using context sensitive online help you will find that each page has a separate help topic, but each tab uses the help topic for the parent page. You can jump to the help pages for the separate tabs from within the online help.

DataStage Documentation
DataStage documentation includes the following: DataStage Enterprise Edition: Parallel Job Developers Guide: This guide describes the tools that are used in building a parallel job, and it supplies programmers reference information. DataStage Enterprise Edition: Parallel Job Advanced Developers Guide: This guide gives more specialized information about parallel job design.

Parallel Job Developers Guide

DataStage Documentation

How to Use this Guide

DataStage Install and Upgrade Guide: This guide describes how to install DataStage on Windows and UNIX systems, and how to upgrade existing installations. DataStage Server: Server Job Developers Guide: This guide describes the tools that are used in building a server job, and it supplies programmers reference information. DataStage Enterprise MVS Edition: Mainframe Job Developers Guide: This guide describes the tools that are used in building a mainframe job, and it supplies programmers reference information. DataStage Designer Guide: This guide describes the DataStage Manager and Designer, and gives a general description of how to create, design, and develop a DataStage application. DataStage Manager Guide: This guide describes the DataStage Director and how to validate, schedule, run, and monitor DataStage server jobs. DataStage Director Guide: This guide describes the DataStage Director and how to validate, schedule, run, and monitor DataStage server jobs. DataStage Administrator Guide: This guide describes DataStage setup, routine housekeeping, and administration. DataStage NLS Guide. This Guide contains information about using the NLS features that are available in DataStage when NLS is installed. These guides are also available online in PDF format. You can read them using the Adobe Acrobat Reader supplied with DataStage. You can use the Acrobat search facilities to search the whole DataStage document set. To use this feature, select Edit Search then choose the All PDF documents in option and specify the DataStage docs directory (by default this is C:\Program Files\Ascential\DataStage\Docs). Extensive online help is also supplied. This is especially useful when you have become familiar with using DataStage and need to look up particular pieces of information.

vi

Parallel Job Developers Guide

Contents
How to Use this Guide
Documentation Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii User Interface Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v DataStage Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Chapter 1

Introduction
DataStage Parallel Jobs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2

Chapter 2

Designing Parallel Jobs


Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1 Pipeline Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2 Partition Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3 Combining Pipeline and Partition Parallelism . . . . . . . . . . . . . . . . . . . . . . . 2-4 Repartitioning Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4 Parallel Processing Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5 The Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6 Partitioning, Repartitioning, and Collecting Data . . . . . . . . . . . . . . . . . . . . . . . 2-7 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7 Collecting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19 Repartitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22 The Mechanics of Partitioning and Collecting . . . . . . . . . . . . . . . . . . . . . . 2-23 Sorting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-26

Book Title

vii

Contents

Meta Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Runtime Column Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Schema Files and Partial Schemas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Strings and Ustrings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Complex Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Incorporating Server Job Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2-26 2-27 2-27 2-28 2-28 2-31 2-32 2-33

Chapter 3

Stage Editors
Showing Stage Validation Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7 The Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8 General Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8 Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12 Link Ordering Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14 NLS Map Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16 NLS Locale Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17 Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18 General Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19 Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20 Format Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25 Columns Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-47 General Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-48 Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-48 Format Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-49 Columns Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-51 Mapping Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-52 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-55

viii

Book Title

Contents

Chapter 4

Data Set Stage


Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Writing to a Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reading from a Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Input Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Output Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2 4-2 4-2 4-2 4-3 4-3 4-4 4-5 4-7 4-7

Chapter 5

Sequential File Stage


Example of Writing a Sequential File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3 Example of Reading a Sequential File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6 Writing to a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6 Reading from a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7 NLS Map Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8 Input Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10 Input Link Format Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25 Output Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-26 Reject Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30 Output Link Format Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30 Using RCP With Sequential Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-42

Chapter 6

File Set Stage


Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2 Writing to a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3 Reading from a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3

Book Title

ix

Contents

Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3 NLS Map Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4 Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5 Input Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8 Input Link Format Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-22 Output Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-23 Reject Link Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-25 Output Link Format Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-25 Using RCP With File Set Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-37

Chapter 7

Lookup File Set Stage


Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating a Lookup File Set: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Looking Up a Lookup File Set: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Input Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Output Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3 7-3 7-3 7-3 7-4 7-5 7-5 7-7 7-9 7-9

Chapter 8

External Source Stage


Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3 NLS Map Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4 Output Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5 Reject Link Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6 Format Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7 Using RCP With External Source Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18

Book Title

Contents

Chapter 9

External Target Stage


Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3 NLS Map Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4 Input Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-6 Format Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-20 Using RCP With External Target Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-21

Chapter 10

Complex Flat File Stage


Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3 File Options Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3 Record Options Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-8 Columns Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-11 Layout Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-16 NLS Map Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17 Input Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18 Input Link Columns Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18 Output Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-21 Selection Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-21 Output Link Columns Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-26 Reject Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-27

Chapter 11

SAS Parallel Data Set Stage


Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Writing an SAS Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reading an SAS Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2 11-2 11-2 11-2 11-3

Book Title

xi

Contents

Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Input Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Output Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11-3 11-4 11-5 11-7 11-7

Chapter 12

DB2/UDB Enterprise Stage


Accessing DB2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3 Remote Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4 Handling Special Characters (# and $) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-5 Using the Pad Character Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-7 Type Conversions - Writing to DB2/UDB . . . . . . . . . . . . . . . . . . . . . . . . . . 12-8 Type Conversions - Reading from DB2/UDB . . . . . . . . . . . . . . . . . . . . . . . 12-9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10 Looking Up a DB2/UDB Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10 Updating a DB2/UDB Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-14 Writing a DB2 Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-14 Updating a DB2 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15 Deleting Rows from a DB2 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15 Loading a DB2 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15 Reading a DB2 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-16 Performing a Direct Lookup on a DB2 Database Table. . . . . . . . . . . . . . 12-16 Performing an In Memory Lookup on a DB2 Database Table . . . . . . . . 12-17 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-17 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-17 NLS Map Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-18 Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-18 Input Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-19 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-36 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-38 Output Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-39

xii

Book Title

Contents

Chapter 13

Oracle Enterprise Stage


Accessing Oracle Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3 Handling Special Characters (# and $) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4 Loading Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5 Type Conversions - Writing to Oracle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6 Type Conversions - Reading from Oracle . . . . . . . . . . . . . . . . . . . . . . . . . 13-8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-8 Looking Up an Oracle Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-8 Updating an Oracle Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12 Updating an Oracle Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12 Deleting Rows from an Oracle Database . . . . . . . . . . . . . . . . . . . . . . . . . 13-13 Loading an Oracle Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-13 Reading an Oracle Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14 Performing a Direct Lookup on an Oracle Database Table . . . . . . . . . . . 13-14 Performing an In Memory Lookup on an Oracle Database Table . . . . . 13-15 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15 NLS Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-16 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-17 Input Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-17 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-25 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-27 Output Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-28

Chapter 14

Teradata Enterprise Stage


Accessing Teradata Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Installing the Teradata Utilities Foundation . . . . . . . . . . . . . . . . . . . . . . . . Creating Teradata User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating a Database Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Teradata Databases Points to Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . NLS Support and Teradata Database Character Sets . . . . . . . . . . . . . . . . Column Name and Data Type Conversion . . . . . . . . . . . . . . . . . . . . . . . . . Restrictions and Limitations when Writing to a Teradata Database . . . . Restrictions on Reading a Teradata Database . . . . . . . . . . . . . . . . . . . . . . 14-2 14-2 14-2 14-2 14-3 14-3 14-4 14-6 14-7

Book Title

xiii

Contents

Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-7 Writing a Teradata Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-7 Reading a Teradata Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-8 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-8 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-8 NLS Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-9 Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10 Input Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-14 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-16 Output Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-17

Chapter 15

Informix Enterprise Stage


Accessing Informix Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2 Considerations for Using the High Performance Loader (HPL) . . . . . . . . 15-2 Using Informix XPS Stages on AIX Systems. . . . . . . . . . . . . . . . . . . . . . . 15-5 Type Conversions - Writing to Informix . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-6 Type Conversions - Reading from Informix. . . . . . . . . . . . . . . . . . . . . . . . 15-6 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-7 Writing an Informix Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-8 Reading an Informix Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-8 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-9 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-9 Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-10 Input Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-10 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-13 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-16 Output Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-16

Chapter 16

Transformer Stage
Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transformer Editor Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Link Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Meta Data Area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shortcut Menus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-2 16-3 16-3 16-3 16-4 16-4

xiv

Book Title

Contents

Transformer Stage Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-5 Input Link. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-5 Output Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-5 Editing Transformer Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-7 Using Drag and Drop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-7 Find and Replace Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-8 Select Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-9 Creating and Deleting Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-9 Moving Columns Within a Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-10 Editing Column Meta Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-10 Defining Output Column Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-10 Editing Multiple Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-13 Handling Null Values in Input Columns . . . . . . . . . . . . . . . . . . . . . . . . . . 16-16 Defining Constraints and Handling Otherwise Links. . . . . . . . . . . . . . . . 16-16 Specifying Link Order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-18 Defining Local Stage Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-18 The DataStage Expression Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-21 Expression Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-21 Entering Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-22 Completing Variable Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-23 Validating the Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-23 Exiting the Expression Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-24 Configuring the Expression Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-24 System Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-24 Guide to Using Transformer Expressions and Stage Variables . . . . . . . 16-24 Transformer Stage Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-27 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-27 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-32 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-34

Chapter 17

BASIC Transformer Stages


Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BASIC Transformer Editor Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Link Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Meta Data Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shortcut Menus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-2 17-3 17-3 17-3 17-4 17-4

Book Title

xv

Contents

BASIC Transformer Stage Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-5 Input Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-5 Output Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-5 Before-Stage and After-Stage Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-6 Editing BASIC Transformer Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-7 Using Drag and Drop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-7 Find and Replace Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-8 Select Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-9 Creating and Deleting Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-10 Moving Columns Within a Link. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-10 Editing Column Meta Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-10 Defining Output Column Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-10 Editing Multiple Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-13 Specifying Before-Stage and After-Stage Subroutines . . . . . . . . . . . . . 17-16 Defining Constraints and Handling Reject Links . . . . . . . . . . . . . . . . . . . 17-17 Specifying Link Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-19 Defining Local Stage Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-20 The DataStage Expression Editor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-22 Expression Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-23 Entering Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-24 Completing Variable Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-25 Validating the Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-25 Exiting the Expression Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-25 Configuring the Expression Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-26 BASIC Transformer Stage Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-26 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-26 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-27 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-30

Chapter 18

Aggregator Stage
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-2 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-5 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-6 Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-6 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-13 NLS Locale Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-14

xvi

Book Title

Contents

Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mapping Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18-15 18-15 18-18 18-18

Chapter 19

Join Stage
Join Versus Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-2 Example Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-3 Inner Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-4 Left Outer Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-4 Right Outer Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-5 Full Outer Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-5 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-6 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-6 Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-7 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-8 Link Ordering Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-9 NLS Locale Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-9 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-10 Partitioning on Input Links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-10 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-13 Mapping Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-13

Chapter 20

Merge Stage
Example Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-3 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-4 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-5 Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-5 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-7 Link Ordering Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-7 NLS Locale Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-8 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-9 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-9 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-12 Reject Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-12 Mapping Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-13

Book Title

xvii

Contents

Chapter 21

Lookup Stage
Lookup Versus Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-5 Example Look Up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-5 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-7 Using In-Memory Lookup tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-8 Using Oracle or DB2 Databases Directly . . . . . . . . . . . . . . . . . . . . . . . . . . 21-9 Using Lookup Fileset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-10 Lookup Editor Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-11 Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-11 Link Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-11 Meta Data Area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-12 Shortcut Menus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-12 Editing Lookup Stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-13 Using Drag and Drop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-13 Find and Replace Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-14 Select Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-15 Creating and Deleting Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-16 Moving Columns Within a Link. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-16 Editing Column Meta Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-16 Defining Output Column Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-16 Defining Input Column Key Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 21-19 Lookup Stage Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-20 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-20 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-24 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-26 Lookup Stage Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-27 The DataStage Expression Editor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-29 Expression Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-29 Entering Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-30 Completing Variable Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-31 Validating the Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-31 Exiting the Expression Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-31 Configuring the Expression Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-32

xviii

Book Title

Contents

Chapter 22

Funnel Stage
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-2 Continuous Funnel Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-2 Sort Funnel Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-4 Sequence Funnel Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-6 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-7 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-8 Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-8 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-9 Link Ordering Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-11 NLS Locale Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-11 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-12 Partitioning on Input Links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-12 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-15 Mapping Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-15

Chapter 23

Sort Stage
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-3 Sequential Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-3 Parallel Sort. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-6 Total Sort. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-8 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-9 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-10 Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-10 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-14 NLS Locale Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-14 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-15 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-15 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-18 Mapping Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-18

Chapter 24

Remove Duplicates Stage


Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-2 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-4

Book Title

xix

Contents

Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-5 Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-5 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-6 NLS Locale Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-7 Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-7 Partitioning on Input Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-8 Output Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-10 Mapping Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-10

Chapter 25

Compress Stage
Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Input Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning on Input Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Output Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-2 25-2 25-2 25-3 25-3 25-4 25-6

Chapter 26

Expand Stage
Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Input Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning on Input Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Output Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-2 26-2 26-2 26-3 26-3 26-4 26-4

Chapter 27

Copy Stage
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Input Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning on Input Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27-2 27-6 27-6 27-6 27-6 27-7 27-8

xx

Book Title

Contents

Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27-10 Mapping Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27-10

Chapter 28

Modify Stage
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-2 Dropping and Keeping Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-2 Changing Data Type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-3 Null Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-4 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-4 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-5 Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-5 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-13 Input Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-14 Partitioning on Input Links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-14 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-16

Chapter 29

Filter Stage
Specifying the Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-2 Input Data Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-2 Supported Boolean Expressions and Operators . . . . . . . . . . . . . . . . . . . . 29-3 String Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-3 Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-4 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-5 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-5 Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-6 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-7 Link Ordering Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-8 NLS Locale Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-8 Input Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-9 Partitioning on Input Links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-9 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-12 Mapping Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-12

Book Title

xxi

Contents

Chapter 30

External Filter Stage


Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Input Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning on Input Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30-2 30-2 30-2 30-3 30-3 30-4 30-6

Chapter 31

Change Capture Stage


Example Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-2 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-3 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-4 Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-4 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-8 Link Ordering Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-9 NLS Locale Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-9 Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-10 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-10 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-13 Mapping Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-13

Chapter 32

Change Apply Stage


Example Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-3 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-5 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-5 Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-5 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-8 Link Ordering Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-9 NLS Locale Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-10 Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-10 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-11 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-13 Mapping Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-13

xxii

Book Title

Contents

Chapter 33

Difference Stage
Example Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-2 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-3 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-4 Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-4 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-7 Link Ordering Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-8 NLS Locale Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-9 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-9 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-10 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-12 Mapping Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-13

Chapter 34

Compare Stage
Example Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-2 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-3 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-4 Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-4 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-6 Link Ordering Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-7 NLS Locale Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-7 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-8 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-8 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-11

Chapter 35

Encode Stage
Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35-2 35-2 35-2 35-3 35-3 35-4 35-6

Book Title

xxiii

Contents

Chapter 36

Decode Stage
Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-1 36-2 36-2 36-2 36-3 36-4 36-4

Chapter 37

Switch Stage
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-2 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-3 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-4 Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-4 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-7 Link Ordering Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-8 NLS Locale Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-8 Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-9 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-9 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-12 Mapping Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-12 Reject Link. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-13

Chapter 38

SAS Stage
Example Job. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-2 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-5 Using the SAS Stage on NLS Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-5 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-6 Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-6 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-10 Link Ordering Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-11 NLS Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-11 Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-12 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-12

xxiv

Book Title

Contents

Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-15 Mapping Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-15

Chapter 39

Generic Stage
Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Link Ordering Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39-2 39-2 39-2 39-3 39-4 39-4 39-4 39-7

Chapter 40

Surrogate Key Stage


Key Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-4 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-6 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-7 Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-7 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-8 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-8 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-9 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-11 Mapping Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-12

Chapter 41

Column Import Stage


Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41-2 41-5 41-6 41-6 41-8 41-9 41-9

Book Title

xxv

Contents

Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Format Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mapping Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reject Link. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using RCP With Column Import Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41-11 41-12 41-23 41-24 41-24

Chapter 42

Column Export Stage


Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-2 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-4 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-5 Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-5 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-7 Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-7 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-8 Format Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-10 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-22 Mapping Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-23 Reject Link. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-23 Using RCP With Column Export Stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-24

Chapter 43

Make Subrecord Stage


Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-3 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-5 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-6 Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-6 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-7 Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-8 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-8 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-10

Chapter 44

Split Subrecord Stage


Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-2 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-5

xxvi

Book Title

Contents

Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-6 Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-6 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-6 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-7 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-7 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-10

Chapter 45

Combine Records Stage


Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-2 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-3 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-5 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-7 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-8 Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-8 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-10 NLS Locale Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-10 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-11 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-11 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-14

Chapter 46

Promote Subrecord Stage


Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-2 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-3 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-5 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-7 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-7 Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-7 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-8 Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-8 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-9 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-11

Book Title

xxvii

Contents

Chapter 47

Make Vector Stage


Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47-2 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47-2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47-5 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47-7 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47-7 Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47-7 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47-8 Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47-8 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47-9 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47-11

Chapter 48

Split Vector Stage


Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48-2 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48-2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48-4 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48-6 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48-6 Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48-6 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48-7 Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48-7 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48-8 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48-10

Chapter 49

Head Stage
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Head Stage Default Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Skipping Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49-2 49-2 49-4 49-4 49-5 49-5 49-7 49-7 49-8

xxviii

Book Title

Contents

Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49-10 Mapping Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49-11

Chapter 50

Tail Stage
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mapping Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50-2 50-3 50-4 50-4 50-5 50-6 50-6 50-8 50-9

Chapter 51

Sample Stage
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51-2 Sampling in Percent Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51-2 Sampling in Period Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51-6 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51-7 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51-8 Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51-8 Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51-10 Link Ordering Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51-11 Input Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51-11 Partitioning on Input Links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51-11 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51-14 Mapping Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51-14

Chapter 52

Peek Stage
Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Properties Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Link Ordering Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52-2 52-2 52-2 52-5 52-6

Book Title

xxix

Contents

Inputs Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mapping Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52-6 52-6 52-9 52-9

Chapter 53

Row Generator Stage


Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using a Row Generator Stage in Default Mode . . . . . . . . . . . . . . . . . . . . Example of Specifying Data to be Generated . . . . . . . . . . . . . . . . . . . . . . Example of Generating Data in Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53-2 53-2 53-3 53-6 53-7 53-8 53-8 53-9 53-9

Chapter 54

Column Generator Stage


Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54-1 Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54-5 Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54-6 Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54-6 Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54-7 Input Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54-8 Partitioning on Input Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54-8 Outputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54-11 Mapping Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54-11

Chapter 55

Write Range Map Stage


Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Must Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stage Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . NLS Locale Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55-2 55-3 55-4 55-4 55-5

xxx

Book Title

Contents

Inputs Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55-5 Input Link Properties Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55-6 Partitioning Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55-6

Chapter 56

Parallel Jobs on USS


Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56-1 Deployment Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56-2 Deploy Under Control of DataStage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56-2 Deploy Standalone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56-5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56-6 Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56-6 Generated Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56-7 Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56-8 Running Jobs on the USS Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56-8 Deploying and Running from DataStage . . . . . . . . . . . . . . . . . . . . . . . . . . 56-8 Deploying from DataStage, Running Manually . . . . . . . . . . . . . . . . . . . . . 56-9 Deploying and Running Manually . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56-11

Chapter 57

Managing Data Sets


Structure of Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Starting the Data Set Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Set Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Viewing the Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Viewing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Copying Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deleting Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57-1 57-2 57-4 57-5 57-5 57-6 57-6

Chapter 58

The Parallel Engine Configuration File


Configurations Editor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-1

Book Title

xxxi

Contents

Configuration Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-3 Logical Processing Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-4 Optimizing Parallelism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-4 Configuration Options for an SMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-6 Example Configuration File for an SMP . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-8 Configuration Options for an MPP System . . . . . . . . . . . . . . . . . . . . . . . . 58-9 An Example of a Four-Node MPP System Configuration . . . . . . . . . . . . 58-10 Configuration Options for an SMP Cluster . . . . . . . . . . . . . . . . . . . . . . . 58-11 An Example of an SMP Cluster Configuration. . . . . . . . . . . . . . . . . . . . . 58-12 Options for a Cluster with the Conductor Unconnected to the High-Speed Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-13 Diagram of a Cluster Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-15 Configuration Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-15 The Default Path Name and the APT_CONFIG_FILE . . . . . . . . . . . . . . . . 58-16 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-16 Node Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-17 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-18 Node Pools and the Default Node Pool . . . . . . . . . . . . . . . . . . . . . . . . . . 58-22 Disk and Scratch Disk Pools and Their Defaults . . . . . . . . . . . . . . . . . . . 58-23 Buffer Scratch Disk Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-24 The resource DB2 Option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-25 The resource INFORMIX Option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-26 The resource ORACLE option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-27 The SAS Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-28 Adding SAS Information to your Configuration File. . . . . . . . . . . . . . . . 58-28 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-29 Sort Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-29 Allocation of Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-30 Selective Configuration with Startup Scripts . . . . . . . . . . . . . . . . . . . . . . . . 58-30 Hints and Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58-32

Chapter 59

SQL Builder
How to Use the SQL Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-1 How to Build Queries with the SQL Builder . . . . . . . . . . . . . . . . . . . . . . . . . . 59-2

xxxii

Book Title

Contents

Selection Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-5 Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-5 Repository Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-6 Table Selection Canvas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-7 Column Selection Grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-8 Filter Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-9 Filter Expression Panel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-10 Group Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-10 Grouping Grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-10 Filter Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-12 Filter Expression Panel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-12 Sql Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-12 Resolve Columns Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-12 Expression Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-14 Main Expression Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-14 Calculation/Function Expression Editor . . . . . . . . . . . . . . . . . . . . . . . . . . 59-20 Expression Editor Menus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-21 Joining Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-23 Specifying Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-25 Join Properties Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-26 Alternate Relation Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-27 Properties Dialogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-28 Table Properties Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-28 SQL Properties Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-29 Example Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-29 Example Simple Select Query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-29 Example Inner Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-32 Example Aggregate Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59-34

Chapter 60

Remote Deployment
Enabling a Project for Job Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60-2

Book Title

xxxiii

Contents

Deployment Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Command Shell Script pxrun.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Environment Variable Setting Source Script evdepfile . . . . . . . . . . . . . Main Parallel (OSH) Program Script OshScript.osh . . . . . . . . . . . . . . . . Script Parameter File jpdepfile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XML Report File <jobname>.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Compiled Transformer Binary Files <jobnamestagename>.trx.so . . . . Self-Contained Transformer Compilation . . . . . . . . . . . . . . . . . . . . . . . . . Deploying a Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Server Side Plug-Ins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60-4 60-5 60-5 60-5 60-5 60-5 60-5 60-6 60-6 60-7

Appendix A

Schemas
Schema Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Date Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Decimal Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Floating-Point Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Integer Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Raw Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . String Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Time Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Timestamp Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subrecords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tagged Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partial Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1 A-3 A-3 A-4 A-4 A-4 A-4 A-5 A-5 A-5 A-6 A-7 A-7

Appendix B

Functions
Date and Time Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1 Logical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4 Mathematical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-5 Null Handling Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7 Number Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7 Raw Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-8 String Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-8 Vector Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-11 Type Conversion Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-11

xxxiv

Book Title

Contents

Type Casting Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14 Utility Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14

Appendix C

Fillers
Creating Fillers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1 Filler Creation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2 Filler Creation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-3 Expanding Fillers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-16

Book Title

xxxv

Contents

xxxvi

Book Title

1
Introduction
This chapter gives an overview of parallel jobs. Parallel jobs are developed using the DataStage Designer and compiled and run on the DataStage server. Such jobs commonly connect to a data source, extract, and transform data and write it to a data warehouse. DataStage allows you to concentrate on designing your job sequentially, without worrying too much about how parallel processing will be implemented. You specify the logic of the job, DataStage specifies the best implementation on the available hardware. If required, however, you can exert more exact control on job implementation. Once designed, parallel jobs can run on SMP MPP or cluster systems. , , Jobs are scaleable. The more processors you have, the faster the job will run. Parallel jobs can also be run on a USS system, special instructions for this are given in Chapter 56.
Note You must choose to either run parallel jobs on standard UNIX systems or on USS systems in the DataStage Administrator. You cannot run both types of job at the same time. See DataStage Administrator Guide.

DataStage also supports server jobs and mainframe jobs. Server jobs are compiled and run on the server. These are for use on non-parallel systems and SMP systems with up to 64 processors. Server jobs are described in Server Job Developers Guide. Mainframe jobs are available if have Enterprise MVS Edition installed. These are loaded onto a mainframe and compiled and run there. Mainframe jobs are described in DataStage Enterprise MVS Edition: Mainframe Job Developers Guide.

Parallel Job Developers Guide

1-1

DataStage Parallel Jobs

Introduction

DataStage Parallel Jobs


DataStage jobs consist of individual stages. Each stage describes a particular process, this may be accessing a database or transforming data in some way. For example, one stage may extract data from a data source, while another transforms it. Stages are added to a job and linked together using the Designer. The following diagram represents one of the simplest jobs you could have: a data source, a Transformer (conversion) stage, and the final database. The links between the stages represent the flow of data into or out of a stage. In a parallel job each stage would correspond to a process. You can have multiple instances of each process to run on the available processors in your system.

Data Source

Transformer Stage

Data Warehouse

You must specify the data you want at each stage, and how it is handled. For example, do you want all the columns in the source data, or only a select few? Are you going to rename any of the columns? How are they going to be transformed? You lay down these stages and links on the canvas of the DataStage Designer. You specify the design as if it was sequential, DataStage determines how the stages will become processes and how many instances of these will actually be run. DataStage also allows you to store reuseable components in the DataStage Repository which can be incorporated into different job designs. You can import these components, or entire jobs, from other DataStage Projects using the DataStage Manager. You can also import meta data directly from data sources and data targets. Guidance on how to construct your job and define the required meta data using the DataStage Designer and the DataStage Manager is in the DataStage Designer Guide and DataStage Manager Guide. Chapter 4 onwards of this manual describe the individual stage editors that you may use when developing parallel jobs.

1-2

Parallel Job Developers Guide

2
Designing Parallel Jobs
The DataStage Parallel Extender brings the power of parallel processing to your data extraction and transformation applications. This chapter gives a basic introduction to parallel processing, and describes some of the key concepts in designing parallel jobs for DataStage. If you are new to DataStage, you should read the introductory chapters of the DataStage Designer Guide first so that you are familiar with the DataStage Designer interface and the way jobs are built from stages and links.

Parallel Processing
There are two basic types of parallel processing; pipeline and partitioning. DataStage allows you to use both of these methods. The following sections illustrate these methods using a simple DataStage job which extracts data from a data source, transforms it in some way, then writes it to another data source. In all cases this job would

Parallel Job Developers Guide

2-1

Parallel Processing

Designing Parallel Jobs

appear the same on your Designer canvas, but you can configure it to behave in different ways (which are shown diagrammatically).

Pipeline Parallelism
If you ran the example job on a system with at least three processors, the stage reading would start on one processor and start filling a pipeline with the data it had read. The transformer stage would start running on another processor as soon as there was data in the pipeline, process it and start filling another pipeline. The stage writing the transformed data to the target database would similarly start writing as soon as there was data available. Thus all three stages are operating simultaneously. If you were running sequentially, there would only be one instance of each stage. If you were running in

2-2

Parallel Job Developers Guide

Designing Parallel Jobs

Parallel Processing

parallel, there would be as many instances as you had partitions (see next section).
Time taken

Conceptual representation of job running with no parallelism

Time taken

Conceptual representation of same job using pipeline parallelism

Partition Parallelism
Imagine you have the same simple job as described above, but that it is handling very large quantities of data. In this scenario you could use the power of parallel processing to your best advantage by partitioning the data into a number of separate sets, with each partition being handled by a separate instance of the job stages. Using partition parallelism the same job would effectively be run simultaneously by several processors, each handling a separate subset of the total data.

Parallel Job Developers Guide

2-3

Parallel Processing

Designing Parallel Jobs

At the end of the job the data partitions can be collected back together again and written to a single data source.

Conceptual representation of job using partition parallelism

Combining Pipeline and Partition Parallelism


In practice you will be combining pipeline and partition parallel processing to achieve even greater performance gains. In this scenario you would have stages processing partitioned data and filling pipelines so the next one could start on that partition before the previous one had finished.

Conceptual representation of job using pipeline and partition parallelism

Repartitioning Data
In some circumstances you may want to actually repartition your data between stages. This might happen, for example, where you want to group data differently. Say you have initially processed data based on customer last name, but now want to process on data grouped by zip code. You will need to repartition to ensure that all customers sharing the same zip code are in the same group. DataStage allows you to repartition between stages as and when needed (although note there are performance implications if you do this and you may affect the

2-4

Parallel Job Developers Guide

Designing Parallel Jobs

Parallel Processing Environments

balance of your partitions see "Identifying Superfluous Repartitions" in Parallel Job Advanced Developers Guide).

Conceptual representation of data repartitioning

Further details about how DataStage actually partitions data, and collects it together again, is given in "Partitioning, Repartitioning, and Collecting Data" on page 2-7.

Parallel Processing Environments


The environment in which you run your DataStage jobs is defined by your systems architecture and hardware resources. All parallelprocessing environments are categorized as one of: SMP (symmetric multiprocessing), in which some hardware resources may be shared among processors. The processors communicate via shared memory and have a single operating system. Cluster or MPP (massively parallel processing), also known as shared-nothing, in which each processor has exclusive access to hardware resources. MPP systems are physically housed in the same box, whereas cluster systems can be physically dispersed. The processors each have their own operating system, and communicate via a high-speed network. SMP systems allow you to scale up the number of processors, which may improve performance of your jobs. The improvement gained depends on how your job is limited:

Parallel Job Developers Guide

2-5

The Configuration File

Designing Parallel Jobs

CPU-limited jobs. In these jobs the memory, memory bus, and disk I/O spend a disproportionate amount of time waiting for the processor to finish its work. Running a CPU-limited application on more processors can shorten this waiting time so speed up overall performance. Memory-limited jobs. In these jobs CPU and disk I/O wait for the memory or the memory bus. SMP systems share memory resources, so it may be harder to improve performance on SMP systems without hardware upgrade. Disk I/O limited jobs. In these jobs CPU, memory and memory bus wait for disk I/O operations to complete. Some SMP systems allow scalability of disk I/O, so that throughput improves as the number of processors increases. A number of factors contribute to the I/O scalability of an SMP including the number of disk , spindles, the presence or absence of RAID, and the number of I/O controllers. In a cluster or MPP environment, you can use the multiple processors and their associated memory and disk resources in concert to tackle a single job. In this environment, each processor has its own dedicated memory, memory bus, disk, and disk access. In a shared-nothing environment, parallelization of your job is likely to improve the performance of CPU-limited, memory-limited, or disk I/O-limited applications.

The Configuration File


One of the great strengths of the DataStage Enterprise Edition is that, when designing jobs, you dont have to worry too much about the underlying structure of your system, beyond appreciating its parallel processing capabilities. If your system changes, is upgraded or improved, or if you develop a job on one platform and implement it on another, you dont necessarily have to change your job design. DataStage learns about the shape and size of the system from the configuration file. It organizes the resources needed for a job according to what is defined in the configuration file. When your system changes, you change the file not the jobs. The configuration file describes available processing power in terms of processing nodes. These may, or may not, correspond to the actual number of processors in your system. You may, for example, want to always leave a couple of processors free to deal with other activities on your system. The number of nodes you define in the configuration file determines how many instances of a process will be produced when you compile a parallel job.

2-6

Parallel Job Developers Guide

Designing Parallel Jobs

Partitioning, Repartitioning, and Collecting Data

Every MPP cluster, or SMP environment has characteristics that define , the system overall as well as the individual processors. These characteristics include node names, disk storage locations, and other distinguishing attributes. For example, certain processors might have a direct connection to a mainframe for performing high-speed data transfers, while others have access to a tape drive, and still others are dedicated to running an RDBMS application. You can use the configuration file to set up node pools and resource pools. A pool defines a group of related nodes or resources, and when you design a DataStage job you can specify that execution be confined to a particular pool. The configuration file describes every processing node that DataStage will use to run your application. When you run a DataStage job, DataStage first reads the configuration file to determine the available system resources. When you modify your system by adding or removing processing nodes or by reconfiguring nodes, you do not need to alter or even recompile your DataStage job. Just edit the configuration file. The configuration file also gives you control over parallelization of your job during the development cycle. For example, by editing the configuration file, you can first run your job on a single processing node, then on two nodes, then four, then eight, and so on. The configuration file lets you measure system performance and scalability without actually modifying your job. You can define and edit the configuration file using the DataStage Manager. This is described in the DataStage Manager Guide, which also gives detailed information on how you might set up the file for different systems. This information is also given in Chapter 58 of this manual.

Partitioning, Repartitioning, and Collecting Data


We have already described how you can use partitioning of data to implement parallel processing in your job (see "Partition Parallelism" on page 2-3). This section takes a closer look at how you can partition data in your jobs, and collect it together again.

Partitioning
In the simplest scenario you probably wont be bothered how your data is partitioned. It is enough that it is partitioned and that the job
Parallel Job Developers Guide 2-7

Partitioning, Repartitioning, and Collecting Data

Designing Parallel Jobs

runs faster. In these circumstances you can safely delegate responsibility for partitioning to DataStage. Once you have identified where you want to partition data, DataStage will work out the best method for doing it and implement it. The aim of most partitioning operations is to end up with a set of partitions that are as near equal size as possible, ensuring an even load across your processors. When performing some operations however, you will need to take control of partitioning to ensure that you get consistent results. A good example of this would be where you are using an aggregator stage to summarize your data. To get the answers you want (and need) you must ensure that related data is grouped together in the same partition before the summary operation is performed on that partition. DataStage lets you do this. There are a number of different partitioning methods available, note that all these descriptions assume you are starting with sequential data. If you are repartitioning already partitioned data then there are some specific considerations (see "Repartitioning" on page 2-22):

Round robin
The first record goes to the first processing node, the second to the second processing node, and so on. When DataStage reaches the last processing node in the system, it starts over. This method is useful for resizing partitions of an input data set that are not equal in size. The round robin method always creates approximately equal-sized

2-8

Parallel Job Developers Guide

Designing Parallel Jobs

Partitioning, Repartitioning, and Collecting Data

partitions. This method is the one normally used when DataStage initially partitions data.
Round Robin Partioner 1 5 9 13

Node 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

2 6 10 14

Node 2

3 7 11 15

Node 3 Input data

4 8 12 16

Round Robin Partitioning

Node 4

Random
Records are randomly distributed across all processing nodes. Like round robin, random partitioning can rebalance the partitions of an input data set to guarantee that each processing node receives an approximately equal-sized partition. The random partitioning has a

Parallel Job Developers Guide

2-9

Partitioning, Repartitioning, and Collecting Data

Designing Parallel Jobs

slightly higher overhead than round robin because of the extra processing required to calculate a random value for each record.
Random Partioner 2 5 15 14

Node 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

9 6 10 13

Node 2

1 7 11 3

Node 3 Input data

8 12 4 16

Random Partitioning

Node 4

Same
The operator using the data set as input performs no repartitioning and takes as input the partitions output by the preceding stage. With this partitioning method, records stay on the same processing node; that is, they are not redistributed. Same is the fastest partitioning

2-10

Parallel Job Developers Guide

Designing Parallel Jobs

Partitioning, Repartitioning, and Collecting Data

method. This is normally the method DataStage uses when passing data between stages in your job.
Same Partioner 1 5 9 13 1 5 9 13

Node 1

Node 1

2 6 10 14

2 6 10 14

Node 2

Node 2

3 7 11 15

3 7 11 15

Node 3

Node 3

4 8 12 16

4 8 12 16

Same Partitioning

Node 4

Node 4

Entire
Every instance of a stage on every processing node receives the complete data set as input. It is useful when you want the benefits of parallel execution, but every instance of the operator needs access to

Parallel Job Developers Guide

2-11

Partitioning, Repartitioning, and Collecting Data

Designing Parallel Jobs

the entire input data set. You are most likely to use this partitioning method with stages that create lookup tables from their input.
Entire Partioner 1 2 3 4 5 6 7 8 Node 1

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8 Node 3

Input data

1 2 3 4 5 6 7 8 Node 4

Entire Partitioning

Hash by field
Partitioning is based on a function of one or more columns (the hash partitioning keys) in each record. The hash partitioner examines one or more fields of each input record (the hash key fields). Records with the same values for all hash key fields are assigned to the same processing node. This method is useful for ensuring that related records are in the same partition, which may be a prerequisite for a processing operation. For example, for a remove duplicates operation, you can hash partition records so that records with the same partitioning key values are on the same node. You can then sort the records on each node using the hash key fields as sorting key fields, then remove duplicates, again using the same keys. Although the data is distributed across partitions, the hash partitioner ensures that records with identical keys are in the same partition, allowing duplicates to be found.

2-12

Parallel Job Developers Guide

Designing Parallel Jobs

Partitioning, Repartitioning, and Collecting Data

Hash partitioning does not necessarily result in an even distribution of data between partitions. For example, if you hash partition a data set based on a zip code field, where a large percentage of your records are from one or two zip codes, you can end up with a few partitions containing most of your records. This behavior can lead to bottlenecks because some nodes are required to process more records than other nodes. For example, the diagram shows the possible results of hash partitioning a data set using the field age as the partitioning key. Each record with a given age is assigned to the same partition, so for example records with age 36, 40, or 22 are assigned to partition 0. The height of each bar represents the number of records in the partition.
Age Values

10 54 17

12 18 27 Partition Size (in records) 36 40 22 15 44 39 35 5 60

Partition Number

As you can see, the key values are randomly distributed among the different partitions. The partition sizes resulting from a hash partitioner are dependent on the distribution of records in the data set so even though there are three keys per partition, the number of records per partition varies widely, because the distribution of ages in the population is non-uniform. When hash partitioning, you should select hashing keys that create a large number of partitions. For example, hashing by the first two digits of a zip code produces a maximum of 100 partitions. This is not a large number for a parallel processing system. Instead, you could hash by five digits of the zip code to create up to 10,000 partitions. You also could combine a zip code hash with an age hash (assuming a maximum age of 190), to yield 1,500,000 possible partitions. Fields that can only assume two values, such as yes/no, true/false, male/female, are particularly poor choices as hash keys. You must define a single primary collecting key for the sort merge collector, and you may define as many secondary keys as are required by your job. Note, however, that each record field can be used only

Parallel Job Developers Guide

2-13

Partitioning, Repartitioning, and Collecting Data

Designing Parallel Jobs

once as a collecting key. Therefore, the total number of primary and secondary collecting keys must be less than or equal to the total number of fields in the record. You specify which columns are to act as hash keys on the Partitioning tab of the stage editor, see "Partitioning Tab" on page 3-20. An example is shown below. The data type of a partitioning key may be any data type except raw, subrecord, tagged aggregate, or vector (see page 2-28 for data types). By default, the hash partitioner does case-sensitive comparison. This means that uppercase strings appear before lowercase strings in a partitioned data set. You can override this default if you want to perform caseinsensitive partitioning on string fields.

Modulus
Partitioning is based on a key column modulo the number of partitions. This method is similar to hash by field, but involves simpler computation. In data mining, data is often arranged in buckets, that is, each record has a tag containing its bucket number. You can use the modulus partitioner to partition the records according to this number. The modulus partitioner assigns each record of an input data set to a partition of its output data set as determined by a specified key field in the input data set. This field can be the tag field. The partition number of each record is calculated as follows:
partition_number = fieldname mod number_of_partitions

where: fieldname is a numeric field of the input data set.

2-14

Parallel Job Developers Guide

Designing Parallel Jobs

Partitioning, Repartitioning, and Collecting Data

number_of_partitions is the number of processing nodes on which the partitioner executes. If a partitioner is executed on three processing nodes it has three partitions. In this example, the modulus partitioner partitions a data set containing ten records. Four processing nodes run the partitioner, and the modulus partitioner divides the data among four partitions. The input data is as follows:

The bucket is specified as the key field, on which the modulus operation is calculated. Here is the input data set. Each line represents a row:
64123 61821 44919 22677 90746 21870 87702 4705 47330 88193 1960-03-30 1960-06-27 1961-06-18 1960-09-24 1961-09-15 1960-01-01 1960-12-22 1961-12-13 1961-03-21 1962-03-12

The following table shows the output data set divided among four partitions by the modulus partitioner.
Partition 0 Partition 1
61821 1960-06-27 22677 1960-09-24 47051961-12-13 88193 1962-03-12

Partition 2
21870 1960-01-01 87702 1960-12-22 47330 1961-03-21 90746 1961-09-15

Partition 3
64123 1960-03-30 44919 1961-06-18

Here are three sample modulus operations corresponding to the values of three of the key fields: 22677 mod 4 = 1; the data is written to Partition 1.
Parallel Job Developers Guide 2-15

Partitioning, Repartitioning, and Collecting Data

Designing Parallel Jobs

47330 mod 4 = 2; the data is written to Partition 2. 64123 mod 4 = 3; the data is written to Partition 3. None of the key fields can be divided evenly by 4, so no data is written to Partition 0. You define the key on the Partitioning tab (see "Partitioning Tab" on page 3-20)

Range
Divides a data set into approximately equal-sized partitions, each of which contains records with key columns within a specified range. This method is also useful for ensuring that related records are in the same partition. A range partitioner divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. In order to use a range partitioner, you have to make a range map. You can do this using the Write Range Map stage, which is described in Chapter 55. The range partitioner guarantees that all records with the same partitioning key values are assigned to the same partition and that the partitions are approximately equal in size so all nodes perform an equal amount of work when processing the data set. An example of the results of a range partition is shown below. The partitioning is based on the age key, and the age range for each partition is indicated by the numbers in each bar. The height of the bar shows the size of the partition.
Age values

Partition size (in records)

0-2

3-17 18-25

26-44

66-71

Partition

All partitions are of approximately the same size. In an ideal distribution, every partition would be exactly the same size. However, you typically observe small differences in partition size. In order to size the partitions, the range partitioner uses a range map to calculate partition boundaries. As shown above, the distribution of partitioning keys is often not even; that is, some partitions contain
2-16 Parallel Job Developers Guide

Designing Parallel Jobs

Partitioning, Repartitioning, and Collecting Data

many partitioning keys, and others contain relatively few. However, based on the calculated partition boundaries, the number of records in each partition is approximately the same. Range partitioning is not the only partitioning method that guarantees equivalent-sized partitions. The random and round robin partitioning methods also guarantee that the partitions of a data set are equivalent in size. However, these partitioning methods are keyless; that is, they do not allow you to control how records of a data set are grouped together within a partition. In order to perform range partitioning your job requires a write range map stage to calculate the range partition boundaries in addition to the stage that actually uses the range partitioner. The write range map stage uses a probabilistic splitting technique to range partition a data set. This technique is described in Parallel Sorting on a SharedNothing Architecture Using Probabilistic Splitting by DeWitt, Naughton, and Schneider in Query Processing in Parallel Relational Database Systems by Lu, Ooi, and Tan, IEEE Computer Society Press, 1994. In order for the stage to determine the partition boundaries, you pass it a sorted sample of the data set to be range partitioned. From this sample, the stage can determine the appropriate partition boundaries for the entire data set. See Chapter 55, "Write Range Map Stage," for details. When you come to actually partition your data, you specify the range map to be used by clicking on the property icon, next to the Partition type field, the Partitioning/Collection properties dialog box appears and allows you to specify a range map (see "Partitioning Tab" on page 3-20 for a description of the Partitioning tab).

Parallel Job Developers Guide

2-17

Partitioning, Repartitioning, and Collecting Data

Designing Parallel Jobs

DB2
Partitions an input data set in the same way that DB2 would partition it. For example, if you use this method to partition an input data set containing update information for an existing DB2 table, records are assigned to the processing node containing the corresponding DB2 record. Then, during the execution of the parallel operator, both the input record and the DB2 table record are local to the processing node. Any reads and writes of the DB2 table would entail no network activity. See the DB2 Parallel Edition for AIX, Administration Guide and Reference for more information on DB2 partitioning. To use DB2 partitioning on a stage, select a Partition type of DB2 on the Partioning tab, then click the Properties button to the right. In the Partitioning/Collection properties dialog box, specify the details of the DB2 table whose partitioning you want to replicate (see "Partitioning Tab" on page 3-20 for a description of the Partitioning tab).

Auto
The most common method you will see on the DataStage stages is Auto. This just means that you are leaving it to DataStage to

2-18

Parallel Job Developers Guide

Designing Parallel Jobs

Partitioning, Repartitioning, and Collecting Data

determine the best partitioning method to use depending on the type of stage, and what the previous stage in the job has done. Typically DataStage would use round robin when initially partitioning data, and same for the intermediate stages of a job.

Collecting
Collecting is the process of joining the multiple partitions of a single data set back together again into a single partition. There are various situations where you may want to do this. There may be a stage in your job that you want to run sequentially rather than in parallel, in which case you will need to collect all your partitioned data at this stage to make sure it is operating on the whole data set. Similarly, at the end of a job, you may want to write all your data to a single database, in which case you need to collect it before you write it. There may be other cases where you dont want to collect the data at all. For example, you may want to write each partition to a separate flat file. Just as for partitioning, in many situations you can leave DataStage to work out the best collecting method to use. There are situations, however, where you will want to explicitly specify the collection method. Note that collecting methods are mostly non-deterministic. That is, if you run the same job twice with the same data, you are unlikely to get data collected in the same order each time. If order matters, you need to use the sorted merge collection method. The following methods are available:

Round robin
Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, starts over. After

Parallel Job Developers Guide

2-19

Partitioning, Repartitioning, and Collecting Data

Designing Parallel Jobs

reaching the final record in any partition, skips that partition in the remaining rounds.
1 2 3 4 Round Robin Collector

Node 1

5 6 7 8

1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

Node 2

9 10 11 12

Node 3 Output data

13 14 15 16

Node 4

Ordered
Reads all records from the first partition, then all records from the second partition, and so on. This collection method preserves the order of totally sorted input data sets. In a totally sorted data set, both the records in each partition and the partitions themselves are

2-20

Parallel Job Developers Guide

Designing Parallel Jobs

Partitioning, Repartitioning, and Collecting Data

ordered. This may be useful as a preprocessing action before exporting a sorted data set to a single data file.
1 2 3 4 Ordered Collector

Node 1

5 6 7 8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Node 2

9 10 11 12

Node 3 Output data

13 14 15 16

Node 4

Sorted merge
Read records in an order based on one or more columns of the record. The columns used to define record order are called collecting keys. Typically, you use the sorted merge collector with a partition-sorted data set (as created by a sort stage). In this case, you specify as the collecting key fields those fields you specified as sorting key fields to the sort stage. For example, the figure below shows the current record in each of three partitions of an input data set to the collector:
Partition 0 Partition 1
Partition 2

Current record

Jane

Smith

42

Paul

Smith

34

Mary

Davis

42

In this example, the records consist of three fields. The first-name and last-name fields are strings, and the age field is an integer. The

Parallel Job Developers Guide

2-21

Partitioning, Repartitioning, and Collecting Data

Designing Parallel Jobs

following figure shows the order of the three records read by the sort merge collector, based on different combinations of collecting keys.
Primary collecting key Order read 1 Jane Smith 42 Paul Smith 34 Mary Davis 42 Primary collecting key Primary collecting key

Mary

Davis

42

Mary

Davis

42

Paul

Smith

34

Paul

Smith

34

Jane

Smith

42

Jane

Smith

42

Secondary collecting key

Secondary collecting key

You must define a single primary collecting key for the sort merge collector, and you may define as many secondary keys as are required by your job. Note, however, that each record field can be used only once as a collecting key. Therefore, the total number of primary and secondary collecting keys must be less than or equal to the total number of fields in the record. You define the keys on the Partitioning tab (see "Partitioning Tab" on page 3-20), and the key you define first is the primary key. The data type of a collecting key can be any type except raw, subrec, tagged, or vector (see page 2-28 for data types). By default, the sort merge collector uses ascending sort order and case-sensitive comparisons. Ascending order means that records with smaller values for a collecting field are processed before records with larger values. You also can specify descending sorting order, so records with larger values are processed first. With a case-sensitive algorithm, records with uppercase strings are processed before records with lowercase strings. You can override this default to perform case-insensitive comparisons of string fields.

Auto
The most common method you will see on the DataStage stages is Auto. This normally means that DataStage will eagerly read any row from any input partition as it becomes available, but if it detects that, for example, the data needs sorting as it is collected, it will do that. This is the fastest collecting method.

Repartitioning
If you decide you need to repartion data within your DataStage job there are some particular considerations as repartitioning can affect the balance of data partitions.

2-22

Parallel Job Developers Guide

Designing Parallel Jobs

Partitioning, Repartitioning, and Collecting Data

For example, if you start with four perfectly balanced partitions and then subsequently repartition into three partitions, you will lose the perfect balance and be left with, at best, near perfect balance. This is true even for the round robin method; this only produces perfectly balanced partitions from a sequential data source. The reason for this is illustrated below. Each node partitions as if it were a single processor with a single data set, and will always start writing to the first target partition. In the case of four partitions repartitioning to three, more rows are written to the first target partition. With a very small data set the effect is pronounced; with a large data set the partitions tend to be more balanced.
Data repartitioned from four partitions to three partitions

The Mechanics of Partitioning and Collecting


This section gives a quick guide to how partitioning and collecting is represented in a DataStage job.

Parallel Job Developers Guide

2-23

Partitioning, Repartitioning, and Collecting Data

Designing Parallel Jobs

Partitioning Icons
Each parallel stage in a job can partition or repartition incoming data before it operates on it. Equally it can just accept the partitions that the data comes in. There is an icon on the input link to a stage which shows how the stage handles partitioning. In most cases, if you just lay down a series of parallel stages in a DataStage job and join them together, the auto method will determine partitioning. This is shown on the canvas by the auto partitioning icon:

In some cases, stages have a specific partitioning method associated with them that cannot be overridden. It always uses this method to organize incoming data before it processes it. In this case an icon on the input link tells you that the stage is repartitioning data:

If you have a data link from a stage running sequentially to one running in parallel the following icon is shown to indicate that the data is being partitioned:

You can specify that you want to accept the existing data partitions by choosing a partitioning method of same. This is shown by the following icon on the input link:

Partitioning methods are set on the Partitioning tab of the Inputs pages on a stage editor (see page 3-20).

Preserve Partitioning Flag


A stage can also request that the next stage in the job preserves whatever partitioning it has implemented. It does this by setting the Preserve Partitioning flag for its output link. Note, however, that the next stage may ignore this request.
2-24 Parallel Job Developers Guide

Designing Parallel Jobs

Sorting Data

In most cases you are best leaving the preserve partitioning flag in its default state. The exception to this is where preserving existing partitioning is important. The flag will not prevent repartioning, but it will warn you that it has happened when you run the job. If the Preserve Partitioning flag is cleared, this means that the current stage doesnt care what the next stage in the job does about partitioning. On some stages, the Preserve Partitioning flag can be set to Propagate. In this case the stage sets the flag on its output link according to what the previous stage in the job has set. If the previous job is also set to Propagate, the setting from the stage before is used and so on until a Set or Clear flag is encountered earlier in the job. If the stage has multiple inputs and has a flag set to Propagate, its Preserve Partitioning flag is set if it is set on any of the inputs, or cleared if all the inputs are clear.

Collecting Icons
A stage in the job which is set to run sequentially will need to collect partitioned data before it operates on it. There is an icon on the input link to a stage which shows that it is collecting data:

Sorting Data
You will probably have requirements in your DataStage jobs to sort data. DataStage has a sort stage (see Chapter 23), which allows you to perform complex sorting operations. There are situations, however, where you require a fairly simple sort as a precursor to a processing operation. For these purposes, DataStage allows you to insert a sort operation in most stage types for incoming data. You do this by selecting the Sorting option on the Input page Partitioning tab (see "Partitioning Tab" on page 3-20). When you do this you can specify: Sorting keys. The field(s) on which data is sorted. You must specify a primary key, but you can also specify any number of secondary keys. The first key you define is taken as the primary. Stable sort (this is the default and specifies that previously sorted data sets are preserved).

Parallel Job Developers Guide

2-25

Data Sets

Designing Parallel Jobs

Unique sort (discards records if multiple records have identical sorting key values). Case sensitivity. Sort direction. Sorted as EBCDIC (ASCII is the default). If you have NLS enabled, you can also specify the collate convention used. Some DataStage operations require that the data they process is sorted (for example, the Merge operation). If DataStage detects that the input data set is not sorted in such a case, it will automatically insert a sort operation in order to enable the processing to take place unless you have explicitly specified otherwise.

Data Sets
Inside a DataStage parallel job, data is moved around in data sets. These carry meta data with them, both column definitions and information about the configuration that was in effect when the data set was created. If for example, you have a stage which limits execution to a subset of available nodes, and the data set was created by a stage using all nodes, DataStage can detect that the data will need repartitioning. If required, data sets can be landed as persistent data sets, represented by a Data Set stage (see Chapter 4, "Data Set Stage.") This is the most efficient way of moving data between linked jobs. Persistent data sets are stored in a series of files linked by a control file (note that you should not attempt to manipulate these files using UNIX tools such as RM or MV. Always use the tools provided with DataStage).
Note The example screenshots in the individual stage descriptions often show the stage connected to a Data Set stage. This does not mean that these kinds of stage can only be connected to Data Set stages.

Meta Data
Meta data is information about data. It describes the data flowing through your job in terms of column definitions, which describe each of the fields making up a data record.

2-26

Parallel Job Developers Guide

Designing Parallel Jobs

Meta Data

DataStage has two alternative ways of handling meta data, through table definitions, or through Schema files. By default, parallel stages derive their meta data from the columns defined on the Outputs or Inputs page Column tab of your stage editor. Additional formatting information is supplied, where needed, by a Formats tab on the Outputs or Inputs page. In some cases you can specify that the stage uses a schema file instead by explicitly setting a property on the stage editor and specify the name and location of the schema file. Note that, if you use a schema file, you should ensure that runtime column propagation is turned on. Otherwise the column definitions specified in the stage editor will always override any schema file. Where is additional formatting information needed? Typically this is where you are reading from, or writing to, a file of some sort and DataStage needs to know more about how data in the file is formatted. You can specify formatting information on a row basis, where the information is applied to every column in every row in the dataset. This is done from the Formats tab (the Formats tab is described with the stage editors that support it; for example, for Sequential files, see page 5-13). You can also specify formatting for particular columns (which overrides the row formatting) from the Edit Column Metadata dialog box for each column (see page 3-28).

Runtime Column Propagation


DataStage is also flexible about meta data. It can cope with the situation where meta data isnt fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP). This can be enabled for a project via the DataStage Administrator (see "Enable Runtime Column Propagation for Parallel Jobs" in DataStage Administrator Guide), and set for individual links via the Outputs Page Columns tab (see "Columns Tab" on page 3-51) for most stages, or in the Outputs page General tab for Transformer stages (see "Outputs Page" on page 16-34). You should always ensure that runtime column propagation is turned on if you want to use schema files to define column meta data.

Table Definitions
A table definition is a set of related columns definitions that are stored in the DataStage Repository. These can be loaded into stages as and when required.
Parallel Job Developers Guide 2-27

Meta Data

Designing Parallel Jobs

You can import a table definition from a data source via the DataStage Manager or Designer. You can also edit and define new table definitions in the Manager or Designer (see "Managing Table Definitions" in DataStage Manager Guide). If you want, you can edit individual column definitions once you have loaded them into your stage. You can also simply type in your own column definition from scratch on the Outputs or Inputs page Column tab of your stage editor (see page 3-26 and page 3-51). When you have entered a set of column definitions you can save them as a new table definition in the Repository for subsequent reuse in another job.

Schema Files and Partial Schemas


You can also specify the meta data for a stage in a plain text file known as a schema file. This is not stored in the DataStage Repository but you could, for example, keep it in a document management or source code control system, or publish it on an intranet site. The format of schema files is described in Appendix A of this manual.
Note If you are using a schema file on an NLS system, the schema file needs to be in UTF-8 format. It is, however, easy to convert text files between two different maps with a DataStage job. Such a job would read data from a text file using a Sequential File stage and specifying the appropriate character set on the NLS Map page. It would write the data to another file using a Sequential File stage, specifying the UTF-8 map on the NLS Map page.

Some parallel job stages allow you to use a partial schema. This means that you only need define column definitions for those columns that you are actually going to operate on. Partial schemas are also described in Appendix A. Remember that you should turn runtime column propagation on if you intend to use schema files to define column meta data.

Data Types
When you work with parallel job column definitions, you will see that they have an SQL type associated with them. This maps onto an underlying data type which you use when specifying a schema via a file, and which you can view in the Parallel tab of the Edit Column Meta Data dialog box (see page 3-26 for details). The underlying data type is what a parallel job data set understands. The following

2-28

Parallel Job Developers Guide

Designing Parallel Jobs

Meta Data

table summarizes the underlying data types that columns definitions can have:
SQL Type
Date Decimal Numeric Float Real Double TinyInt

Underlying Data Type


date decimal sfloat dfloat int8 uint8 int16 uint16 int32 uint32 int64 uint64 raw

Size
4 bytes (Roundup(p)+1)/2 4 bytes 8 bytes 1 byte

Description
Date with month, day, and year Packed decimal, compatible with IBM packed decimal format IEEE single-precision (32-bit) floating point value IEEE double-precision (64-bit) floating point value Signed or unsigned integer of 8 bits (Extended (unsigned) option for unsigned) Signed or unsigned integer of 16 bits (Extended (unsigned) option for unsigned) Signed or unsigned integer of 32 bits (Extended (unsigned) option for unsigned) Signed or unsigned integer of 64 bits (Extended (unsigned) option for unsigned) Untypes collection, consisting of a fixed or variable number of contiguous bytes and an optional alignment value ASCII character string of fixed or variable length (without the extended(Unicode) option selected) ASCII character string of fixed or variable length (without the extended(Unicode) option selected) ASCII character string of fixed or variable length (with the extended(Unicode) option selected) Complex data type comprising nested columns Complex data type comprising tagged columns, of which one can be referenced when the column is used

SmallInt

2 bytes

Integer

4 bytes

BigInt1

8 bytes

Binary Bit LongVarBinary VarBinary Unknown Char LongVarChar VarChar NChar NVarChar LongNVarChar Char LongVarChar VarChar Char Char

1 byte per character

string

1 byte per character

ustring

multiple bytes per character multiple bytes per character sum of lengths of subrecord fields sum of lengths of subrecord fields

ustring

subrec tagged

Parallel Job Developers Guide

2-29

Meta Data

Designing Parallel Jobs

SQL Type
Time Time

Underlying Data Type


time time(microseconds)

Size
5 bytes 5 bytes

Description
Time of day, with resolution of seconds. Time of day, with resolution of microseconds (Extended (Microseconds) option selected). Single field containing both data and time value Single field containing both data and time value, with resolution of microseconds (Extended (Microseconds) option selected).

Timestamp Timestamp

timestamp timestamp(microsec onds)

9 bytes 9 bytes

1 BigInt values map to long long integers on all supported platforms except Tru64 where they map to longer integers. For all platforms except Tru64, the c_format is: '%[padding_character][integer]lld' Because Tru64 supports real 64-bit integers, its c_format is: '%[padding_character][integer]ld' The integer component specifies a minimum field width. The output column is printed at least this wide, and wider if necessary. If the column has fewer digits than the field width, it is padded on the left with padding_character to make up the field width. The default padding character is a space. For this example c_format specification: '%09lld' the padding character is zero (0), and the integers 123456 and 12345678 are printed out as 000123456 and 123456789.

When you work with mainframe data using the CFF stage, the data types are as follows:
COBOL Data Type
binary, native binary binary, native binary binary, native binary binary, native binary binary, native binary binary, native binary character character for filler 2 bytes 4 bytes 8 bytes 2 bytes 4 bytes 8 bytes n bytes n bytes S9(1-4) COMP/COMP-5 S9(5-9) COMP/COMP-5 S9(10-18) COMP/COMP-5 9(1-4) COMP/COMP-5 9(5-9) COMP/COMP-5 9(10-18) COMP/COMP-5 X(n) X(n)

Underlying Data Type


int16 int32 int64 uint16 uint32 uint64 string[n] raw(n)

2-30

Parallel Job Developers Guide

Designing Parallel Jobs

Meta Data

COBOL Data Type


varchar decimal n bytes (x+y)/ 2+1 bytes (x+y)/ 2+1 bytes x+y bytes X(n) 9(x)V9(y)COMP-3

Underlying Data Type


string[ma x=n] decimal[x +y,y] decimal[x +y,y] decimal[x +y,y] or string[x+y ] decimal[x +y,y] or string[x+y ] decimal[x +y,y] packed

decimal

S9(x)V9(y)COMP-3

packed

display_numeric

9(x)V9(y)

zoned

display_numeric

x+y bytes

S9(x)V9(y)

zoned, trailing

display_numeric

x+y bytes

S9(x)V9(y) SIGN IS
TRAILING

zoned, trailing

display_numeric display_numeric display_numeric float graphic_n, graphic_g vargraphic_g/n group

x+y bytes x+y+1 bytes x+y+1 bytes 4 bytes 8 bytes n*2 bytes n*2 bytes

S9(x)V9(y) SIGN IS
LEADING

decimal[x +y,y] decimal[x +y,y] decimal[x +y,y] sfloat dfloat ustring[n] ustring[m ax=n] subrec

zoned, leading separate, trailing separate, leading floating point

S9(x)V9(y) SIGN IS
TRAILING SEPARATE

S9(x)V9(y) SIGN IS LEADING SEPARATE COMP-1 COMP-2 N(n) or G(n) DISPLAY-1 N(n) or G(n) DISPLAY-1

Strings and Ustrings


If you have NLS enabled, parallel jobs support two types of underlying character data types: strings and ustrings. String data represents unmapped bytes, ustring data represents full Unicode (UTF-16) data. The Char, VarChar, and LongVarChar SQL types relate to underlying string types where each character is 8-bits and does not require

Parallel Job Developers Guide

2-31

Meta Data

Designing Parallel Jobs

mapping because it represents an ASCII character. You can, however, specify that these data types are extended, in which case they are taken as ustrings and do require mapping. (They are specified as such by selecting the Extended check box for the column in the Edit Meta Data dialog box.) An Extended field appears in the columns grid, and extended Char, VarChar, or LongVarChar columns have Unicode in this field. The NChar, NVarChar, and LongNVarChar types relate to underlying ustring types so do not need to be explicitly extended.

Complex Data Types


Parallel jobs support three complex data types: Subrecords Tagged subrecords Vectors When referring to complex data in DataStage column definitions, you can specify fully qualified column names, for example:
Parent.Child5.Grandchild2

Subrecords
A subrecord is a nested data structure. The column with type subrecord does not itself define any storage, but the columns it contains do. These columns can have any data type, and you can nest subrecords one within another. The LEVEL property is used to specify the structure of subrecords. The following diagram gives an example of a subrecord structure.
Parent (subrecord) Child1 (string) Child2 (string) LEVEL 01 Child3 (integer) Child4 (date) Child5 (subrecord) Grandchild1 (string) LEVEL02 Grandchild2 (time) Grandchild3 (sfloat)

Tagged Subrecord
This is a special type of subrecord structure, it comprises a number of columns of different types and the actual column is ONE of these, as indicated by the value of a tag at run time. The columns can be of any

2-32

Parallel Job Developers Guide

Designing Parallel Jobs

Incorporating Server Job Functionality

type except subrecord or tagged. The following diagram illustrates a tagged subrecord.
Parent (tagged) Child1 (string) Child2 (int8) Child3 (raw) Tag = Child1, so column has data type of string

Vector
A vector is a one dimensional array of any type except tagged. All the elements of a vector are of the same type, and are numbered from 0. The vector can be of fixed or variable length. For fixed length vectors the length is explicitly stated, for variable length ones a property defines a link field which gives the length at run time. The following diagram illustrates a vector of fixed length and one of variable length.
Fixed Length int32 0 int32 1 int32 int32 int32 int32 2 3 4 5 int32 int32 6 7 int32 8

Variable Length int32 int32 int32 int32 int32 int32 2 3 4 5 int32 6 int32 N

0 1 link field = N

Incorporating Server Job Functionality


You can incorporate Server job functionality in your Parallel jobs by the use of Server Shared Container stages. This allows you to, for example, use Server job plug-in stages to access data source that are not directly supported by Parallel jobs. (Some plug-ins have parallel versions that you can use directly in a parallel job.) You create a new shared container in the DataStage Designer, add Server job stages as required, and then add the Server Shared Container to your Parallel job and connect it to the Parallel stages. Server Shared Container stages used in Parallel jobs have extra pages in their Properties dialog box, which enable you to specify details about parallel processing and partitioning and collecting data.

Parallel Job Developers Guide

2-33

Incorporating Server Job Functionality

Designing Parallel Jobs

You can only use Server Shared Containers in this way on SMP systems (not MPP or cluster systems). The following limitations apply to the contents of such Server Shared Containers: There must be zero or one container inputs, zero or more container outputs, and at least one of either. There can be no disconnected flows all stages must be linked to the input or an output of the container directly or via an active stage. When the container has an input and one or more outputs, each stage must connect to the input and at least one of the outputs. There can be no synchronization by having a passive stage with both input and output links. For details on how to use Server Shared Containers, see "Containers" in DataStage Designer Guide. This also tells you how to use Parallel Shared Containers, which enable you to package parallel job functionality in a reuseable form.

2-34

Parallel Job Developers Guide

3
Stage Editors
The Parallel job stage editors all use a generic user interface (with the exception of the Transformer stage, Shared Container, and Complex Flat File stages). This chapter describes the generic editor and gives a guide to using it. Parallel jobs have a large number of stages available. They are organized into groups in the tool palette or you can drag all the stages you use frequently to the Favorites category. The stage editors are divided into the following basic types: Database. These are stages that read or write data contained in a database. Examples of database stages are the Oracle Enterprise and DB2/UDB Enterprise stages. Development/Debug. These are stages that help you when you are developing and troubleshooting parallel jobs. Examples are the Peek and Row Generator stages. File. These are stages that read or write data contained in a file or set of files. Examples of file stages are the Sequential File and Data Set stages. Processing. These are stages that perform some processing on the data that is passing through them. Examples of processing stages are the Aggregator and Transformer stages. Real Time. These are the stages that allow Parallel jobs to be made available as RTI services. They comprise the RTI Source and RTI Target stages. These are part of the optional Web Services package. Restructure. These are stages that deal with and manipulate data containing columns of complex data type. Examples are Make Subrecord and Make Vector stages.

Parallel Job Developers Guide

3-1

Stage Editors

Parallel jobs also support local containers and shared containers. Local containers allow you to tidy your designs by putting portions of functionality in a container, the contents of which are viewed on a separate canvas. Shared containers are similar, but are stored separately in the repository and can be reused by other parallel jobs. Parallel jobs can use both Parallel Shared Containers and Server Shared Containers. Using shared containers is described in DataStage Designer Guide. The following table lists the available stage types and gives a quick guide to their function:
Icon Stage Type Function

Data Set (Chapter 4) Sequential File (Chapter 5) File Set (Chapter 6)

File

Allows you to read data from or write data to a persistent data set. Allows you to read data from or write data to one or more flat files. Allows you to read data from or write data to a file set. File sets enable you to spread data across a set of files referenced by a single control file. Allows you to create a lookup file set or reference one for a lookup. Allows you to read data that is output from one or more source programs. Allows you to write data to one or more source programs. Allows you to read or write complex flat files on a mainframe machine. This is intended for use on USS systems (note that it uses a different interface from other file stages). Allows you to read data from or write data to a parallel SAS data set in conjunction with an SAS stage. Allows you to read data from and write data to a DB2 database.

File

File

Lookup File Set (Chapter 7) External Source (Chapter 8) External Target (Chapter 9) Complex Flat File (Chapter 10)

File

File

File File

SAS Data Set (Chapter 11) DB2/UDB Enterprise (Chapter 12)

File

Database

3-2

Parallel Job Developers Guide

Stage Editors

Icon

Stage

Type

Function

Oracle Enterprise (Chapter 13)

Database

Allows you to read data from and write data to a Oracle database. Allows you to read data from and write data to a Teradata database. Allows you to read data from and write data to an Informix database. Handles extracted data, performs any conversions required, and passes data to another active stage or a stage that writes data to a target database or file. Same as Transformer stage, but gives access to DataStage BASIC functions.
Classifies incoming data into groups, computes totals and other summary functions for each group, and passes them to another stage in the job.

Teradata Enterprise Database (Chapter 14) Informix Enterprise Database (Chapter 15) Transformer (Chapter 16) Processing

BASIC Transformer Processing (Chapter 17) Aggregator (Chapter 18) Join (Chapter 19) Processing

Processing

Performs join operations on two or more data sets input to the stage and then outputs the resulting data set. Combines a sorted master data set with one or more sorted update data sets. Used to perform lookup operations on a data set read into memory from any other Parallel job stage that can output data or provided by one of the database stages that support reference output links. It can also perform a look up on a lookup table contained in a Lookup File Set stage. Sorts input columns.

Merge (Chapter 20) Processing Lookup (Chapter 21) Processing

Sort (Chapter 23)

Processing

Funnel (Chapter 22) Processing

Copies multiple input data sets to a single output data set.

Parallel Job Developers Guide

3-3

Stage Editors

Icon

Stage

Type

Function

Remove Duplicates Processing (Chapter 24) Compress (Chapter 25) Processing

Takes a single sorted data set as input, removes all duplicate records, and writes the results to an output data set. Uses the UNIX compress or GZIP utility to compress a data set. It converts a data set from a sequence of records into a stream of raw binary data. Uses the UNIX uncompress or GZIP utility to expand a data set. It converts a previously compressed data set back into a sequence of records from a stream of raw binary data. Copies a single input data set to a number of output data sets. Alters the record schema of its input data set. Transfers, unmodified, the records of the input data set which satisfy requirements that you specify and filters out all other records. Allows you to specify a UNIX command that acts as a filter on the data you are processing. Takes two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set. Takes the change data set, that contains the changes in the before and after data sets, from the Change Capture stage and applies the encoded change operations to a before data set to compute an after data set.

Expand (Chapter 26)

Processing

Copy (Chapter 27)

Processing

Modify (Chapter 28) Filter (Chapter 29)

Processing

Processing

External Filter (Chapter 30) Change Capture (Chapter 31)

Processing

Processing

Change Apply (Chapter 32)

Processing

3-4

Parallel Job Developers Guide

Stage Editors

Icon

Stage

Type

Function

Difference (Chapter 33)

Processing

Performs a record-by-record comparison of two input data sets, which are different versions of the same data set. Performs a column-by-column comparison of records in two presorted input data sets. Encodes a data set using a UNIX encoding command that you supply. Decodes a data set using a UNIX decoding command that you supply. Takes a single data set as input and assigns each input record to an output data set based on the value of a selector field. Allows you to execute part or all of an SAS application in parallel. Lets you incorporate an Orchestrate Operator in your job. Generates one or more surrogate key columns and adds them to an existing data set. Imports data from a single column and outputs it to one or more columns. Exports data from a number of columns of different data types into a single column of data type string or binary. Combines specified vectors in an input data set into a vector of subrecords whose columns have the names and data types of the original vectors. Creates one new vector column for each element of the original subrecord.

Compare (Chapter 34) Encode (Chapter 35) Decode (Chapter 36)

Processing

Processing

Processing

Switch (Chapter 37) Processing

SAS (Chapter 38)

Processing

Generic (Chapter 39) Surrogate Key (Chapter 40) Column Import (Chapter 41) Column Export (Chapter 42) Make Subrecord (Chapter 43)

Processing

Processing

Restructure

Restructure

Restructure

Split Subrecord (Chapter 44)

Restructure

Parallel Job Developers Guide

3-5

Stage Editors

Icon

Stage

Type

Function

Combine Records (Chapter 45)

Restructure

Combines records, in which particular key-column values are identical, into vectors of subrecords. Promotes the columns of an input subrecord to top-level columns. Combines specified columns of an input data record into a vector of columns of the same type. Promotes the elements of a fixed-length vector to a set of similarly named toplevel columns. Selects the first N records from each partition of an input data set and copies the selected records to an output data set. Selects the last N records from each partition of an input data set and copies the selected records to an output data set. Samples an input data set.

Promote Subrecord Restructure (Chapter 46) Make Vector (Chapter 47) Split Vector (Chapter 48) Head (Chapter 49) Restructure

Restructure

Development/ Debug

Tail (Chapter 50)

Development/ Debug

Sample (Chapter 51)

Development/ Debug

Peek (Chapter 52)

Development/ Debug

Lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets. Produces a set of mock data fitting the specified meta data. Adds columns to incoming data and generates mock data for these columns for each data row processed. Allows you to write data to a range map. The stage can have a single input link.

Row Generator (Chapter 53) Column Generator (Chapter 54) Write Range Map (Chapter 55)

Development/ Debug Development/ Debug Development/ Debug

3-6

Parallel Job Developers Guide

Stage Editors

Showing Stage Validation Errors

All of the stage types use the same basic stage editor, but the pages that actually appear when you edit the stage depend on the exact type of stage you are editing. The following sections describe all the page types and sub tabs that are available. The individual descriptions of stage editors in the following chapters tell you exactly which features of the generic editor each stage type uses.

Showing Stage Validation Errors


If you enable the Show stage validation errors option in the Diagram menu (or toolbar), the DataStage Designer will give you visual cues for parallel jobs or parallel shared containers. The visual cues display compilation errors for every stage on the canvas, without you having to actually compile the job. The option is enabled by default. Here is an example of a parallel job showing visual cues:

The top Oracle stage has a warning triangle, showing that there is a compilation error. If you hover the mouse pointer over the stage a tooltip appears, showing the particular errors for that stage. Any local containers on your canvas will behave like a stage, i.e., all the compile errors for stages within the container are displayed. You have to open a parallel shared container in order to see any compile problems on the individual stages.
Note Parallel transformer stages will only show certain errors; to detect C++ errors in the stage, you have to actually compile the job containing it.

Parallel Job Developers Guide

3-7

The Stage Page

Stage Editors

The Stage Page


All stage editors have a Stage page. This contains a number of subsidiary tabs depending on the stage type. The only field the Stage page itself contains gives the name of the stage being edited.

General Tab
All stage editors have a General tab, this allows you to enter an optional description of the stage. Specifying a description here enhances job maintainability.

Properties Tab
A Properties tab appears on the Stage page where there are general properties that need setting for the particular stage you are editing. Properties tabs can also occur under Input and Output pages where there are link-specific properties that need to be set.

3-8

Parallel Job Developers Guide

Stage Editors

The Stage Page

The properties for most general stages are set under the Stage page.

Property Value field

The available properties are displayed in a tree structure. They are divided into categories to help you find your way around them. All the mandatory properties are included in the tree by default and cannot be removed. Properties that you must set a value for (i.e. which have not got a default value) are shown in the warning color (red by default), but change to black when you have set a value. You can change the warning color by opening the Options dialog box (select Tools Options from the DataStage Designer main menu) and choosing the Transformer item from the tree. Reset the Invalid column color by clicking on the color bar and choosing a new color from the palette. To set a property, select it in the list and specify the required property value in the property value field. The title of this field and the method for entering a value changes according to the property you have selected. In the example above, the Key property is selected so the Property Value field is called Key and you set its value by choosing one of the available input columns from a drop down list. Key is shown in red because you must select a key for the stage to work properly. The Information field contains details about the property you currently have selected in the tree. Where you can browse for a property value, or insert a job parameter whose value is provided at run time, a right arrow appears next to the field. Click on this and a menu gives access to the Browse Files dialog box and/or a list of available job parameters (job parameters are defined in the Job

Parallel Job Developers Guide

3-9

The Stage Page

Stage Editors

Properties dialog box - see "Job Properties" in DataStage Designer Guide). Some properties have default values, and you can always return to the default by selecting it in the tree and choosing Set to default from the shortcut menu. Some properties are optional. These appear in the Available properties to add field. Click on an optional property to add it to the tree or choose to add it from the shortcut menu. You can remove it again by selecting it in the tree and selecting Remove from the shortcut menu. Some properties can be repeated. In the example above you can add multiple key properties. The Key property appears in the Available properties to add list when you select the tree top level Properties node. Click on the Key item to add multiple key properties to the tree. Where a repeatable property expects a column as an argument, a dialog is available that lets you specify multiple columns at once. To open this, click the column button next to the properties tree:

Column button

The Column Selection dialog box opens. The left pane lists all the available columns, use the arrow right keys to select some or all of them (use the left arrow keys to move them back if you change your

3-10

Parallel Job Developers Guide

Stage Editors

The Stage Page

mind). A separate property will appear for each column you have selected..

Some properties have dependents. These are properties which somehow relate to or modify the parent property. They appear under the parent in a tree structure. For some properties you can supply a job parameter as their value. At runtime the value of this parameter will be used for the property. Such properties will have an arrow next to their Property Value box. Click the arrow to get a drop-down menu, then choose Insert job parameter get a list of currently defined job parameters to chose from (see "Specifying Job Parameters" in DataStage Designer Guidefor information about job parameters). You can switch to a multiline editor for entering property values for some properties. Do this by clicking on the arrow next to their

Parallel Job Developers Guide

3-11

The Stage Page

Stage Editors

Property Value box and choosing Switch to multiline editor from the menu.

The property capabilities are indicated by different icons in the tree as follows: non-repeating property with no dependents non-repeating property with dependents repeating property with no dependents repeating property with dependents The properties for individual stage types are described in the chapter about the stage.

Advanced Tab
All stage editors have an Advanced tab. This allows you to: Specify the execution mode of the stage. This allows you to choose between Parallel and Sequential operation. If the execution mode for a particular type of stage cannot be changed, then this drop down list is disabled. Selecting Sequential operation forces the stage to be executed on a single node. If you have intermixed sequential and parallel stages this has implications for partitioning and collecting data between the stages. You can also let DataStage decide by choosing the default setting for the stage (the drop down list tells you whether this is parallel or sequential). Set or clear the preserve partitioning flag (this field is not available for all stage types). It indicates whether the stage wants to preserve partitioning at the next stage of the job (see "Preserve
3-12 Parallel Job Developers Guide

Stage Editors

The Stage Page

Partitioning Flag" on page 2-24). You choose between Set, Clear and Propagate. For some stage types, Propagate is not available. The operation of each option is as follows:

Set. Sets the preserve partitioning flag, this indicates to the next stage in the job that it should preserve existing partitioning if possible. Clear. Clears the preserve partitioning flag. Indicates that this stage does not care which partitioning method the next stage uses. Propagate. Sets the flag to Set or Clear depending on what the previous stage in the job has set (or if that is set to Propagate the stage before that and so on until a preserve partitioning flag setting is encountered).

You can also let DataStage decide by choosing the default setting for the stage (the drop down list tells you whether this is set, clear, or propagate). Specify the combinability mode. Under the covers DataStage can combine the operators that underlie parallel stages so that they run in the same process. This saves a significant amount of data copying and preparation in passing data between operators. The combinability mode setting tells DataStage your preferences for combining for a particular stage. It has three possible settings:

Auto. Use the default combination setting. Combinable. Ignore the operator's default setting and combine if at all possible (some operators are marked as noncombinable by default). Don't Combine. Never combine operators.

In most cases the setting should be left to Auto. Specify node map or node pool or resource pool constraints. The configuration file allows you to set up pools of related nodes or resources (see "The Configuration File" on page 2-6). The Advanced tab allows you to limit execution of a stage to a particular node or resource pool. You can also use a map to specify a group of nodes that execution will be limited to just in this stage. Supply details as follows:

Node pool and resource constraints. Specify constraints in the grid. Select Node pool or Resource pool from the Constraint drop-down list. Select a Type for a resource pool and, finally, select the name of the pool you are limiting execution to. You can select multiple node or resource pools. This is only enabled if you have defined multiple pools in the configuration file.

Parallel Job Developers Guide

3-13

The Stage Page

Stage Editors

Node map constraints. Select the option box and type in the nodes to which execution will be limited in the text box. You can also browse through the available nodes to add to the text box. Using this feature conceptually sets up an additional node pool which doesnt appear in the configuration file.

The lists of available nodes, available node pools, and available resource pools are derived from the configuration file.

Link Ordering Tab


This tab allows you to order the links for stages that have more than one link and where ordering of the links is required.

3-14

Parallel Job Developers Guide

Stage Editors

The Stage Page

The tab allows you to order input links and/or output links as needed. Where link ordering is not important or is not possible the tab does not appear.

The link label gives further information about the links being ordered. In the example we are looking at the Link Ordering tab for a Join stage. The join operates in terms of having a left link and a right link, and this tab tells you which actual link the stage regards as being left and which right. If you use the arrow keys to change the link order, the link name changes but not the link label. In our example, if you pressed the down arrow button, DSLink27 would become the left link, and DSLink26 the right. A Join stage can only have one output link, so in the example the Order the following output links section is disabled. The following example shows the Link Ordering tab from a Merge stage. In this case you can order both input links and output links. The Merge stage handles reject links as well as a stream link and the tab allows you to order these, although you cannot move them to the

Parallel Job Developers Guide

3-15

The Stage Page

Stage Editors

stream link position. Again the link labels give the sense of how the links are being used.

The individual stage descriptions tell you whether link ordering is possible and what options are available.

NLS Map Tab


If you have NLS enabled on your system, some of your stages will have an NLS Map tab. This allows you to override the project default character set map for this stage, and in some cases, allows you to enable per-column mapping. When per-column mapping is enabled, you can override the character set map for particular columns (an NLS map field appears on the columns tab allowing you to do this).

3-16

Parallel Job Developers Guide

Stage Editors

The Stage Page

Select a map from the list, or click the arrow button next to the list to specify a job parameter.

The following stage types currently support this feature: Sequential File File Set Lookup File Set External Source External Target DB2/UDB Enterprise (not per-column mapping) Oracle Enterprise (not per-column mapping)

NLS Locale Tab


If you have NLS enabled on your system, some of your stages will have an NLS Locale tab. It lets you view the current default collate convention, and select a different one for the stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated, for example, the character follows A in Germany, but follows Z in Sweden.

Parallel Job Developers Guide

3-17

Inputs Page

Stage Editors

Select a locale from the list, or click the arrow button next to the list to use a job parameter or browse for a collate file.

The following types of stage have an NLS Locale tab: Stages that evaluate expressions, such as the Transformer. Stages that need to evaluate the order of key columns. The Sort Stage.

Inputs Page
The Inputs page gives information about links going into a stage. In the case of a file or database stage an input link carries data being written to the file or database. In the case of a processing or restructure stage it carries data that the stage will process before outputting to another stage. Where there are no input links, the stage editor has no Inputs page. Where it is present, the Inputs page contains various tabs depending on stage type. The only field the Inputs page itself contains is Input name, which gives the name of the link being edited. Where a stage has more than one input link, you can select the link you are editing from the Input name drop-down list. The Inputs page also has a Columns button. Click this to open a window showing column names from the meta data defined for this link. You can drag these columns to various fields in the Inputs page tabs as required.

3-18

Parallel Job Developers Guide

Stage Editors

Inputs Page

Certain stage types will also have a View Data button. Press this to view the actual data associated with the specified data source or data target. The button is available if you have defined meta data for the link. Note the interface allowing you to view the file will be slightly different depending on stage and link type.

General Tab
The Inputs page always has a General tab. this allows you to enter an optional description of the link. Specifying a description for each link enhances job maintainability.

Properties Tab
Some types of file and database stages can have properties that are particular to specific input links. In this case the Inputs page has a

Parallel Job Developers Guide

3-19

Inputs Page

Stage Editors

Properties tab. This has the same format as the Stage page Properties tab (see "Properties Tab" on page 3-8).

Partitioning Tab
Most parallel stages have a default partitioning or collecting method associated with them. This is used depending on the execution mode of the stage (i.e., parallel or sequential) and the execution mode of the immediately preceding stage in the job. For example, if the preceding stage is processing data sequentially and the current stage is processing in parallel, the data will be partitioned before it enters the current stage. Conversely if the preceding stage is processing data in parallel and the current stage is sequential, the data will be collected as it enters the current stage. You can, if required, override the default partitioning or collecting method on the Partitioning tab. The selected method is applied to the incoming data as it enters the stage on a particular link, and so the Partitioning tab appears on the Inputs page. You can also use the tab to repartition data between two parallel stages. If both stages are executing sequentially, you cannot select a partition or collection method and the fields are disabled. The fields are also disabled if the particular stage does not permit selection of partitioning or collection methods. The following table shows what can be set from the Partitioning tab in what circumstances:
Preceding Stage Current Stage
Parallel Parallel

Partition Tab Mode


Partition

3-20

Parallel Job Developers Guide

Stage Editors

Inputs Page

Preceding Stage Current Stage


Parallel Sequential Sequential Sequential Parallel Sequential

Partition Tab Mode


Collect Partition None (disabled)

The Partitioning tab also allows you to specify that the data should be sorted as it enters.

The Partitioning tab has the following fields: Partition type. Choose the partitioning (or collecting) type from the drop-down list. The following partitioning types are available:

(Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for many stages. Entire. Every processing node receives the entire data set. No further information is required. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. No further information is required.
3-21

Parallel Job Developers Guide

Inputs Page

Stage Editors

Round Robin. The records are partitioned on a round robin basis as they enter the stage. No further information is required. Same. Preserves the partitioning already in place. No further information is required. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button .

The following collection types are available:

(Auto). Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. This is the fastest collecting method and is the default collection method for many stages. In some circumstances DataStage will detect further requirements for collected data, for example, it might need to be sorted. Using Auto mode will ensure data is sorted if required. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Requires no further information. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

Available. This lists the input columns for the input link. Key columns are identified by a key icon. For partitioning or collecting methods that require you to select columns, you click on the required column in the list and it appears in the Selected list to the right. This list is also used to select columns to sort on. Selected. This list shows which columns have been selected for partitioning on, collecting on, or sorting on and displays information about them. The available information is whether a sort is being performed (indicated by an arrow), if so the order of the sort (ascending or descending) and collating sequence (sort as EBCDIC), and whether an alphanumeric key is case sensitive or not. Nullable columns are marked to indicate if null columns take first or last position. You can select sort order, case sensitivity,

3-22

Parallel Job Developers Guide

Stage Editors

Inputs Page

collating sequence, and nulls position from the shortcut menu. If applicable, the Usage field indicates whether a particular key column is being used for sorting, partitioning, or both. Sorting. The check boxes in the section allow you to specify sort details. The availability of sorting depends on the partitioning method chosen.

Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. The default is stable. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, whether sorted as EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu. The availability of the sort options depends on the type of data in the column, whether it is nullable or not, and the partitioning method chosen. If you have NLS enabled, the sorting box has an additional button. Click this to open the NLS Locales tab of the Sort Properties dialog box. This lets you view the current default collate convention, and select a different one for the stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated, for example, the character follows A in Germany, but follows Z in Sweden. Select a locale from the list, or click the arrow button next to the list to use a job parameter or browse for a collate file. If you require a more complex sort operation, you should use the Sort stage (see Chapter 21).

Parallel Job Developers Guide

3-23

Inputs Page

Stage Editors

DB2 Partition Properties


This dialog box appears when you select a Partition type of DB2 and click the properties button . It allows you to specify the DB2 table whose partitioning method is to be replicated.

Range Partition Properties


This dialog box appears when you select a Partition type of Range and click the properties button . It allows you to specify the range map that is to be used to determine the partitioning (you create a range map file using the Write Range Map stage - see Chapter 55). Type in a pathname or browse for a file.

3-24

Parallel Job Developers Guide

Stage Editors

Inputs Page

Format Tab
Stages that write to certain types of file (e.g., the Sequential File stage) also have a Format tab which allows you to specify the format of the file or files being written to.

The Format tab is similar in structure to the Properties tab. A flat file has a number of properties that you can set different attributes for. Select the property in the tree and select the attributes you want to set from the Available properties to add window, it will then appear as a dependent property in the property tree and you can set its value as required. This tab sets the format information for the file at row level. You can override the settings for individual columns using the Edit Column Metadata dialog box (see page 3-28). If you click the Load button you can load the format information from a table definition in the Repository. The shortcut menu from the property tree gives access to the following functions: Format as. This applies a predefined template of properties. Choose from the following:

Delimited/quoted Fixed-width records UNIX line terminator DOS line terminator No terminator (fixed width) Mainframe (COBOL)

Parallel Job Developers Guide

3-25

Inputs Page

Stage Editors

Add sub-property. Gives access to a list of dependent properties for the currently selected property (visible only if the property has dependents). Set to default. Appears if the currently selected property has been set to a non-default value, allowing you to re-select the default. Remove. Removes the currently selected property. This is disabled if the current property is mandatory. Remove all. Removes all the non-mandatory properties. Details of the properties you can set are given in the chapter describing the individual stage editors: Sequential File stage page 5-13 File Set stage page 6-10 External Target stage page 9-8 Column Export stage page 42-10

Columns Tab
The Inputs page always has a Columns tab. This displays the column meta data for the selected input link in a grid.

There are various ways of populating the grid: If the other end of the link has meta data specified for it, this will be displayed in the Columns tab (meta data is associated with, and travels with, a link).

3-26

Parallel Job Developers Guide

Stage Editors

Inputs Page

You can type the required meta data into the grid. When you have done this, you can click the Save button to save the meta data as a table definition in the Repository for subsequent reuse. You can load an existing table definition from the Repository. Click the Load button to be offered a choice of table definitions to load. Note that when you load in this way you bring in the columns definitions, not any formatting information associated with them (to load that, go to the Format tab). You can drag a table definition from the Repository Window on the Designer onto a link on the canvas. This transfers both the column definitions and the associated format information. If you select the options in the Grid Properties dialog box (see "Grid Properties" in DataStage Designer Guide), the Columns tab will also display two extra fields: Table Definition Reference and Column Definition Reference. These show the table definition and individual columns that the columns on the tab were derived from. If you click in a row and select Edit Row from the shortcut menu, the Edit Column Meta Data dialog box appears, which allows you edit the row details in a dialog box format. It also has a Parallel tab which allows you to specify properties that are peculiar to parallel job column definitions. The dialog box only shows those properties that are relevant for the current link.

The Parallel tab enables you to specify properties that give more detail about each column, and properties that are specific to the data type. Where you are specifying complex data types, you can specify a

Parallel Job Developers Guide

3-27

Inputs Page

Stage Editors

level number, which causes the Level Number field to appear in the grid on the Columns page. If you have NLS enabled, and the column has an underlying string type, you can specify that the column contains Unicode data by selecting the Extended (Unicode) check box. Where you can enter a character for any property, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). Some table definitions need format information. This occurs where data is being written to a file where DataStage needs additional information in order to be able to locate columns and rows. Properties for the table definition at row level are set on the Format tab of the relevant stage editor, but you can override the settings for individual columns using the Parallel tab. The settings are made in a properties tree under the following categories:

Field Level
This has the following properties: Bytes to Skip. Skip the specified number of bytes from the end of the previous column to the beginning of this column. Delimiter. Specifies the trailing delimiter of the column. Type an ASCII character or select one of whitespace, end, none, null, comma, or tab.

whitespace. The last column of each record will not include any trailing white spaces found at the end of the record. end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the same as a setting of None which is used for fields with fixed-width columns. none. No delimiter (used for fixed-width). null. ASCII Null character is used. comma. ASCII comma character used. tab. ASCII tab character used.

Delimiter string. Specify a string to be written at the end of the column. Enter one or more characters. This is mutually exclusive with Delimiter, which is the default. For example, specifying , (comma space you do not need to enter the inverted commas) would have the column delimited by , . Drop on input. Select this property when you must fully define the meta data for a data set, but do not want the column actually read into the data set.

3-28

Parallel Job Developers Guide

Stage Editors

Inputs Page

Prefix bytes. Specifies that this column is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the columns length or the tag value for a tagged column. You can use this option with variable-length fields. Variable-length fields can be either delimited by a character or preceded by a 1-, 2-, or 4-byte prefix containing the field length. DataStage inserts the prefix before each field. This property is mutually exclusive with the Delimiter, Quote, and Final Delimiter properties, which are used by default. Print field. This property is intended for use when debugging jobs. Set it to have DataStage produce a message for each of the columns it reads. The message has the format:
Importing N: D

where:

N is the column name. D is the imported data of the column. Non-printable characters conained in D are prefixed with an escape character and written as C string literals; if the column contains binary data, it is output in octal format.

Quote. Specifies that variable length columns are enclosed in single quotes, double quotes, or another ASCII character or pair of ASCII characters. Choose Single or Double, or enter a character. Start position. Specifies the starting position of a column in the record. The starting position can be either an absolute byte offset from the first record position (0) or the starting position of another column. Tag case value. Explicitly specifies the tag value corresponding to a subfield in a tagged subrecord. By default the fields are numbered 0 to N-1, where N is the number of fields. (A tagged subrecord is a column whose type can vary. The subfields of the tagged subrecord are the possible types. The tag case value of the tagged subrecord selects which of those types is used to interpret the columns value for the record.)

String Type
This has the following properties: Character Set. Choose from ASCII or EBCDIC (not available for ustring type (Unicode)). Default. The default value for a column. This is used for data written by a Generate stage. It also supplies the value to substitute for a column that causes an error (whether written or read).

Parallel Job Developers Guide

3-29

Inputs Page

Stage Editors

Export EBCDIC as ASCII. Select this to specify that EBCDIC characters are written as ASCII characters (not available for ustring type (Unicode)). Is link field. Selected to indicate that a column holds the length of another, variable-length column of the record or of the tag value of a tagged record field. Import ASCII as EBCDIC. Select this to specify that ASCII characters are read as EBCDIC characters (not available for ustring type (Unicode)). Field max width. The maximum number of bytes in a column represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width character set, you can calculate the length exactly. If you are using variable-length character set, calculate an adequate maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. Field width. The number of bytes in a column represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the number of bytes exactly. If its a variable length encoding, base your calculation on the width and frequency of your variable-width characters. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. Pad char. Specifies the pad character used when strings or numeric values are written to an external string representation. Enter a character (single-byte for strings, can be multi-byte for ustrings) or choose null or space. The pad character is used when the external string representation is larger than required to hold the written field. In this case, the external string is filled with the pad character to its full length. Space is the default. Applies to string, ustring, and numeric data types and record, subrec, or tagged types if they contain at least one field of this type.

Date Type
Byte order. Specifies how multiple byte data types are ordered. Choose from:

little-endian. The high byte is on the right. big-endian. The high byte is on the left. native-endian. As defined by the native format of the machine.

Character Set. Choose from ASCII or EBCDIC.

3-30

Parallel Job Developers Guide

Stage Editors

Inputs Page

Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see DataStage NLS Guide). Data Format. Specifies the data representation format of a column. Choose from:

binary text

For dates, binary is equivalent to specifying the julian property for the date field, text specifies that the data to be written contains a text-based date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see DataStage NLS Guide). Default. The default value for a column. This is used for data written by a Generate stage. It also supplies the value to substitute for a column that causes an error (whether written or read). Format string. The string format of a date. By default this is %yyyy-%mm-%dd. The Format string can contain one or a combination of the following elements:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff, for example %1970yy. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 366). %mmm: Three-character month abbreviation.

The format_string is subject to the following restrictions:


It cannot have more than one element of the same type, for example it cannot contain two %dd elements. It cannot have both %dd and %ddd. It cannot have both %yy and %yyyy. It cannot have both %mm and %ddd. It cannot have both %mmm and %ddd. It cannot have both %mm and %mmm. If it has %dd, it must have %mm or %mmm. It must have exactly one of %yy or %yyyy.

Parallel Job Developers Guide

3-31

Inputs Page

Stage Editors

When you specify a date format string, prefix each component with the percent symbol (%). Separate the strings components with any character except the percent sign (%). If this format string does not include a day, it is set to the first of the month in the destination field. If the format string does not include the month and day, they default to January 1. Note that the format string must contain a month if it also contains a day; that is, you cannot omit only the month. The year_cutoff is the year defining the beginning of the century in which all two digit years fall. By default, the year cutoff is 1900; therefore, a two-digit year of 97 represents 1997. You can also set this using the environment variable APT_DATE_CENTURY_BREAK_YEAR (see "APT_DATE_CENTURY_BREAK_YEAR" in Parallel Job Advanced Developers Guide), but this is overridden by %year_cutoffyy if you have set it. You can specify any four-digit year as the year cutoff. All two-digit years then specify the next possible year ending in the specified two digits that is the same or greater than the cutoff. For example, if you set the year cutoff to 1930, the two-digit year 30 corresponds to 1930, and the two-digit year 29 corresponds to 2029. Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.

Time Type
Byte order. Specifies how multiple byte data types are ordered. Choose from:

little-endian. The high byte is on the right. big-endian. The high byte is on the left. native-endian. As defined by the native format of the machine.

Character Set. Choose from ASCII or EBCDIC. Default. The default value for a column. This is used for data written by a Generate stage. It also supplies the value to substitute for a column that causes an error (whether written or read). Data Format. Specifies the data representation format of a column. Choose from:

binary text

3-32

Parallel Job Developers Guide

Stage Editors

Inputs Page

For time, binary is equivalent to midnight_seconds, text specifies that the field represents time in the text-based form %hh:%nn:%ss or or in the default date format if you have defined a new one on an NLS system (see DataStage NLS Guide). Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss. The possible components of the time format string are:

%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent symbol. Separate the strings components with any character except the percent sign (%). Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight.

Timestamp Type
Byte order. Specifies how multiple byte data types are ordered. Choose from:

little-endian. The high byte is on the right. big-endian. The high byte is on the left. native-endian. As defined by the native format of the machine.

Character Set. Choose from ASCII or EBCDIC. Data Format. Specifies the data representation format of a column. Choose from:

binary text

For timestamp, binary specifies that the first integer contains a Julian day count for the date portion of the timestamp and the second integer specifies the time portion of the timestamp as the number of seconds from midnight. A binary timestamp specifies that two 32-but integers are written. Text specifies a text-based timestamp in the form %yyyy-%mm-%dd %hh:%nn:%ss or in the

Parallel Job Developers Guide

3-33

Inputs Page

Stage Editors

default date format if you have defined a new one on an NLS system (see DataStage NLS Guide). Default. The default value for a column. This is used for data written by a Generate stage. It also supplies the value to substitute for a column that causes an error (whether written or read). Format string. Specifies the format of a column representing a timestamp as a string. Defaults to %yyyy-%mm-%dd %hh:%nn:%ss. Specify the format as follows: For the date:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 - 366)

For the time:


%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent symbol (%). Separate the strings components with any character except the percent sign (%).

Integer Type
Byte order. Specifies how multiple byte data types are ordered. Choose from:

little-endian. The high byte is on the right. big-endian. The high byte is on the left. native-endian. As defined by the native format of the machine.

Character Set. Choose from ASCII or EBCDIC.

3-34

Parallel Job Developers Guide

Stage Editors

Inputs Page

C_format. Perform non-default conversion of data from a string to integer data. This property specifies a C-language format string used for reading/writing integer strings. This is passed to sscanf() or sprintf(). Default. The default value for a column. This is used for data written by a Generate stage. It also supplies the value to substitute for a column that causes an error (whether written or read). Data Format. Specifies the data representation format of a column. Choose from:

binary text

Field max width. The maximum number of bytes in a column represented as a string. Enter a number. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width character set, you can calculate the length exactly. If you are using variable-length character set, calculate an adequate maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. Field width. The number of bytes in a column represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the number of bytes exactly. If its a variable length encoding, base your calculation on the width and frequency of your variable-width characters. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. In_format. Format string used for conversion of data from string to integer. This is passed to sscanf(). By default, DataStage invokes the C sscanf() function to convert a numeric field formatted as a string to either integer or floating point data. If this function does not output data in a satisfactory format, you can specify the in_format property to pass formatting arguments to sscanf(). Is link field. Selected to indicate that a column holds the length of another, variable-length column of the record or of the tag value of a tagged record field. Out_format. Format string used for conversion of data from integer to a string. This is passed to sprintf(). By default, DataStage invokes the C sprintf() function to convert a numeric field formatted as integer data to a string. If this function does not output data in a satisfactory format, you can specify the out_format property to pass formatting arguments to sprintf().

Parallel Job Developers Guide

3-35

Inputs Page

Stage Editors

Pad char. Specifies the pad character used when the integer is written to an external string representation. Enter a character (single-bye for strings, can be multi-byte for ustrings) or choose null or space. The pad character is used when the external string representation is larger than required to hold the written field. In this case, the external string is filled with the pad character to its full length. Space is the default.

Decimal Type
Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No. Character Set. Choose from ASCII or EBCDIC. Decimal separator. Specify the character that acts as the decimal separator (period by default). Default. The default value for a column. This is used for data written by a Generate stage. It also supplies the value to substitute for a column that causes an error (whether written or read). Data Format. Specifies the data representation format of a column. Choose from:

binary text

For decimals, binary means packed. Text represents a decimal in a string format with a leading space or '-' followed by decimal digits with an embedded decimal point if the scale is not zero. The destination string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored. Field max width. The maximum number of bytes in a column represented as a string. Enter a number. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width character set, you can calculate the length exactly. If you are using variable-length character set, calculate an adequate maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. Field width. The number of bytes in a column represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the number of bytes exactly. If its a variable length encoding, base your calculation on the width and frequency of your variable-width characters. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type.

3-36

Parallel Job Developers Guide

Stage Editors

Inputs Page

Packed. Select an option to specify what the decimal columns contain, choose from:

Yes to specify that the decimal columns contain data in packed decimal format (the default). This has the following subproperties: Check. Select Yes to verify that data is packed, or No to not verify. Signed. Select Yes to use the existing sign when writing decimal columns. Select No to write a positive sign (0xf) regardless of the columns actual sign value.

No (separate) to specify that they contain unpacked decimal with a separate sign byte. This has the following sub-property: Sign Position. Choose leading or trailing as appropriate.

No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This has the following subproperty: Sign Position. Choose leading or trailing as appropriate.

No (overpunch) to specify that the field has a leading or end byte that contains a character which specifies both the numeric value of that byte and whether the number as a whole is negatively or positively signed. This has the following subproperty: Sign Position. Choose leading or trailing as appropriate.

Precision. Specifies the precision where a decimal column is represented in text format. Enter a number. When a decimal is written to a string representation, DataStage uses the precision and scale defined for the source decimal field to determine the length of the destination string. The precision and scale properties override this default. When they are defined, DataStage truncates or pads the source decimal to fit the size of the destination string. If you have also specified the field width property, DataStage truncates or pads the source decimal to fit the size specified by field width. Rounding. Specifies how to round the source field to fit into the destination decimal when reading a source field to a decimal. Choose from:

up (ceiling). Truncate source column towards positive infinity. This mode corresponds to the IEEE 754 Round Up mode. For example, 1.4 becomes 2, -1.6 becomes -1.

Parallel Job Developers Guide

3-37

Inputs Page

Stage Editors

down (floor). Truncate source column towards negative infinity. This mode corresponds to the IEEE 754 Round Down mode. For example, 1.6 becomes 1, -1.4 becomes -2. nearest value. Round the source column towards the nearest representable value. This mode corresponds to the COBOL ROUNDED mode. For example, 1.4 becomes 1, 1.5 becomes 2, 1.4 becomes -1, -1.5 becomes -2. truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, truncate to the scale size of the destination decimal. This mode corresponds to the COBOL INTEGER-PART function. Using this method 1.6 becomes 1, -1.6 becomes -1.

Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination. By default, when the DataStage writes a source decimal to a string representation, it uses the precision and scale defined for the source decimal field to determine the length of the destination string. You can override the default by means of the precision and scale properties. When you do, DataStage truncates or pads the source decimal to fit the size of the destination string. If you have also specified the field width property, DataStage truncates or pads the source decimal to fit the size specified by field width. Specifies how to round a source decimal when its precision and scale are greater than those of the destination.

Float Type
C_format. Perform non-default conversion of data from a string to floating-point data. This property specifies a C-language format string used for reading floating point strings. This is passed to sscanf(). Character Set. Choose from ASCII or EBCDIC. Default. The default value for a column. This is used for data written by a Generate stage. It also supplies the value to substitute for a column that causes an error (whether written or read). Data Format. Specifies the data representation format of a column. Choose from:

binary text

3-38

Parallel Job Developers Guide

Stage Editors

Inputs Page

Field max width. The maximum number of bytes in a column represented as a string. Enter a number. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width character set, you can calculate the length exactly. If you are using variable-length character set, calculate an adequate maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. Field width. The number of bytes in a column represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the number of bytes exactly. If its a variable length encoding, base your calculation on the width and frequency of your variable-width characters. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. In_format. Format string used for conversion of data from string to floating point. This is passed to sscanf(). By default, DataStage invokes the C sscanf() function to convert a numeric field formatted as a string to floating point data. If this function does not output data in a satisfactory format, you can specify the in_format property to pass formatting arguments to sscanf(). Is link field. Selected to indicate that a column holds the length of a another, variable-length column of the record or of the tag value of a tagged record field. Out_format. Format string used for conversion of data from floating point to a string. This is passed to sprintf(). By default, DataStage invokes the C sprintf() function to convert a numeric field formatted as floating point data to a string. If this function does not output data in a satisfactory format, you can specify the out_format property to pass formatting arguments to sprintf(). Pad char. Specifies the pad character used when the floating point number is written to an external string representation. Enter a character (single-bye for strings, can be multi-byte for ustrings) or choose null or space. The pad character is used when the external string representation is larger than required to hold the written field. In this case, the external string is filled with the pad character to its full length. Space is the default.

Nullable
This appears for nullable fields. Actual field length. Specifies the number of bytes to fill with the Fill character when a field is identified as null. When DataStage identifies a null field, it will write a field of this length full of Fill characters. This is mutually exclusive with Null field value.

Parallel Job Developers Guide

3-39

Inputs Page

Stage Editors

Null field length. The length in bytes of a variable-length field that contains a null. When a variable-length field is read, a length of null field length in the source field indicates that it contains a null. When a variable-length field is written, DataStage writes a length value of null field length if the field contains a null. This property is mutually exclusive with null field value. Null field value. Specifies the value given to a null field if the source is set to null. Can be a number, string, or C-type literal escape character. For example, you can represent a byte value by \ooo, where each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F You must use this . form to encode non-printable byte values. This property is mutually exclusive with Null field length and Actual length. For a fixed width data representation, you can use Pad char (from the general section of Type defaults) to specify a repeated trailing character if the value you specify is shorter than the fixed width of the field. On reading, specifies the value given to a field containing a null. On writing, specifies the value given to a field if the source is set to null. Can be a number, string, or Ctype literal escape character.

Generator
If the column is being used in a Row Generator or Column Generator stage, this allows you to specify extra details about the mock data being generated. The exact fields that appear depend on the data type of the column being generated. They allow you to specify features of the data being generated, for example, for integers they allow you to specify if values are random or whether they cycle. If they cycle you can specify an initial value, an increment, and a limit. If they are random, you can specify a seed value for the random number generator, whether to include negative numbers, and a limit

3-40

Parallel Job Developers Guide

Stage Editors

Inputs Page

The diagram below shows the Generate options available for the various data types:.
Cycle String Algorithm Alphabet Increment Initial value Limit Limit Seed Signed Increment Initial value Limit Limit Seed Signed Increment Initial value Limit Limit Seed Signed String Value

Cycle Date Random Epoch Percent invalid Use current date

Cycle Time Random Scale factor Percent invalid

Cycle Timestamp Random Epoch Percent invalid Use current date

Cycle Integer Random

Increment Initial value Limit Limit Seed Signed Increment Initial value Limit Limit Seed Signed Increment Initial value Limit Limit Seed Signed

Cycle Decimal Random Percent zero Percent invalid

Cycle Float Random

All data types All data types other than string have two Types of operation, cycle and random: Cycle. The cycle option generates a repeating pattern of values for a column. It has the following optional dependent properties:

Parallel Job Developers Guide

3-41

Inputs Page

Stage Editors

Increment. The increment value added to produce the field value in the next output record. The default value is 1 (integer) or 1.0 (float). Initial value. is the initial field value (value of the first output record). The default value is 0. Limit. The maximum field value. When the generated field value is greater than Limit, it wraps back to Initial value. The default value of Limit is the maximum allowable value for the fields data type.

You can set these to part to use the partition number (e.g., 0, 1, 2, 3 on a four node system), or partcount to use the total number of executing partitions (e.g., 4 on a four node system). Random. The random option generates random values for a field. It has the following optional dependent properties:

Limit. Maximum generated field value. The default value of limit is the maximum allowable value for the fields data type. Seed. The seed value for the random number generator used by the stage for the field. You do not have to specify seed. By default, the stage uses the same seed value for all fields containing the random option. Signed. Specifies that signed values are generated for the field (values between -limit and +limit). Otherwise, the operator creates values between 0 and +limit.

You can limit and seed to part to use the partition number (e.g., 0, 1, 2, 3 on a four node system), or partcount to use the total number of executing partitions (e.g., 4 on a four node system). Strings By default the generator stages initialize all bytes of a string field to the same alphanumeric character. The stages use the following characters, in the following order:
abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ

For example, the following a string with a length of 5 would produce successive string fields with the values:
aaaaa bbbbb ccccc ddddd ...

After the last character, capital Z, values wrap back to lowercase a and the cycle repeats.

3-42

Parallel Job Developers Guide

Stage Editors

Inputs Page

You can also use the algorithm property to determine how string values are generated, this has two possible values: cycle and alphabet: Cycle. Values are assigned to a generated string field as a set of discrete string values to cycle through. This has the following dependent property:

Values. Repeat this property to specify the string values that the generated data cycles through.

Alphabet. Values are assigned to a generated string field as a character string each of whose characters is taken in turn. This is like the default mode of operation except that you can specify the string cycled through using the dependent property String. Decimal As well as the Type property, decimal columns have the following properties: Percent invalid. The percentage of generated columns that will contain invalid values. Set to 10% by default. Percent zero. The percentage of generated decimal columns where all bytes of the decimal are set to binary zero (0x00). Set to 10% by default. Date As well as the Type property, date columns have the following properties: Epoch. Use this to specify the earliest generated date value, in the format yyyy-mm-dd (leading zeros must be supplied for all parts). The default is 1960-01-01. Percent invalid. The percentage of generated columns that will contain invalid values. Set to 10% by default. Use current date. Set this to generate todays date in this column for every row generated. If you set this all other properties are ignored. Time As well as the Type property, time columns have the following properties: Percent invalid. The percentage of generated columns that will contain invalid values. Set to 10% by default.

Parallel Job Developers Guide

3-43

Inputs Page

Stage Editors

Scale factor. Specifies a multiplier to the increment value for time. For example, a scale factor of 60 and an increment of 1 means the field increments by 60 seconds. Timestamp As well as the Type property, time columns have the following properties: Epoch. Use this to specify the earliest generated date value, in the format yyyy-mm-dd (leading zeros must be supplied for all parts). The default is 1960-01-01. Use current date. Set this to generate todays date in this column for every row generated. If you set this all other properties are ignored. Percent invalid. The percentage of generated columns that will contain invalid values. Set to 10% by default. Scale factor. Specifies a multiplier to the increment value for time. For example, a scale factor of 60 and an increment of 1 means the field increments by 60 seconds.

Vectors
If the row you are editing represents a column which is a variable length vector, tick the Variable check box. The Vector properties appear, these give the size of the vector in one of two ways: Link Field Reference. The name of a column containing the number of elements in the variable length vector. This should have an integer or float type, and have its Is Link field property set. Vector prefix. Specifies 1-, 2-, or 4-byte prefix containing the number of elements in the vector. If the row you are editing represents a column which is a vector of known length, enter the number of elements in the Vector Occurs box.

Subrecords
If the row you are editing represents a column which is part of a subrecord the Level Number column indicates the level of the column within the subrecord structure. If you specify Level numbers for columns, the column immediately preceding will be identified as a subrecord. Subrecords can be nested, so can contain further subrecords with higher level numbers (i.e., level 06 is nested within level 05). Subrecord fields have a Tagged check box to indicate that this is a tagged subrecord.

3-44

Parallel Job Developers Guide

Stage Editors

Inputs Page

Extended
For certain data types the Extended check box appears to allow you to modify the data type as follows: Char, VarChar, LongVarChar. Select to specify that the underlying data type is a ustring. Time. Select to indicate that the time field includes microseconds. Timestamp. Select to indicate that the timestamp field includes microseconds. TinyInt, SmallInt, Integer, BigInt types. Select to indicate that the underlying data type is the equivalent uint field.

Advanced Tab
The Advanced tab allows you to specify how DataStage buffers data being input this stage. By default DataStage buffers data in such a way that no deadlocks can arise; a deadlock being the situation where a number of stages are mutually dependent, and are waiting for input from another stage and cannot output until they have received it. The size and operation of the buffer are usually the same for all links on all stages (the default values that the settings take can be set using environment variables see "Configuring for Enterprise Edition" of the Install and Upgrade Guide). The Advanced tab allows you to specify buffer settings on a per-link basis. You should only change the settings if you fully understand the consequences of your actions (otherwise you might cause deadlock situations to arise).

Parallel Job Developers Guide

3-45

Inputs Page

Stage Editors

Any changes you make on this tab will automatically be reflected in the Outputs Page Advanced Tab of the stage at the other end of this link.

The settings are as follows: Buffering mode. Select one of the following from the drop-down list.

(Default). This will take whatever the default settings are as specified by the environment variables (this will be Auto-buffer unless you have explicitly changed the value of the APT_BUFFERING _POLICY environment variable). Auto buffer. Buffer output data only if necessary to prevent a dataflow deadlock situation. Buffer. This will unconditionally buffer all data output from this stage. No buffer. Do not buffer output data under any circumstances. This could potentially lead to deadlock situations if not used carefully.

If you choose the Auto buffer or Buffer options, you can also set the values of the various buffering parameters: Maximum memory buffer size (bytes). Specifies the maximum amount of virtual memory, in bytes, used per buffer. The default size is 3145728 (3 MB). Buffer free run (percent). Specifies how much of the available in-memory buffer to consume before the buffer resists. This is expressed as a percentage of Maximum memory buffer size. When the amount of data in the buffer is less than this value, new
3-46 Parallel Job Developers Guide

Stage Editors

Outputs Page

data is accepted automatically. When the data exceeds it, the buffer first tries to write some of the data it contains before accepting more. The default value is 50% of the Maximum memory buffer size. You can set it to greater than 100%, in which case the buffer continues to store data up to the indicated multiple of Maximum memory buffer size before writing to disk. Queue upper bound size (bytes). Specifies the maximum amount of data buffered at any time using both memory and disk. The default value is zero, meaning that the buffer size is limited only by the available disk space as specified in the configuration file (resource scratchdisk). If you set Queue upper bound size (bytes) to a non-zero value, the amount of data stored in the buffer will not exceed this value (in bytes) plus one block (where the data stored in a block cannot exceed 32 KB). If you set Queue upper bound size to a value equal to or slightly less than Maximum memory buffer size, and set Buffer free run to 1.0, you will create a finite capacity buffer that will not write to disk. However, the size of the buffer is limited by the virtual memory of your system and you can create deadlock if the buffer becomes full. Disk write increment (bytes). Sets the size, in bytes, of blocks of data being moved to/from disk by the buffering operator. The default is 1048576 (1 MB). Adjusting this value trades amount of disk access against throughput for small amounts of data. Increasing the block size reduces disk access, but may decrease performance when data is being read/written in smaller units. Decreasing the block size increases throughput, but may increase the amount of disk access.

Outputs Page
The Outputs page gives information about links going out of a stage. In the case of a file or database stage an input link carries data being read from the file or database. In the case of a processing or restructure stage it carries data that the stage has processed. Where there are no output links the stage editor has no Outputs page. Where it is present, the Outputs page contains various tabs depending on stage type. The only field the Outputs page itself contains is Output name, which gives the name of the link being edited. Where a stage has more than one output link, you can select the link you are editing from the Output name drop-down list.

Parallel Job Developers Guide

3-47

Outputs Page

Stage Editors

The Outputs page also has a Columns button. Click Columns to open a window showing column names from the meta data defined for this link. You can drag these columns to various fields in the Outputs page tabs as required. Certain stage types will also have a View Data button. Press this to view the actual data associated with the specified data source or data target. The button is available if you have defined meta data for the link. The Sequential File stage has a Show File button, rather than View Data . This shows the flat file as it has been created on disk.

General Tab
The Outputs page always has a General tab. this allows you to enter an optional description of the link. Specifying a description for each link enhances job maintainability.

Properties Tab
Some types of file and database stages can have properties that are particular to specific output links. In this case the Outputs page has a

3-48

Parallel Job Developers Guide

Stage Editors

Outputs Page

Properties tab. This has the same format as the Stage page Properties tab (see "Properties Tab" on page 3-8).

Format Tab
Stages that read from certain types of file (e.g., the Sequential File stage) also have a Format tab which allows you to specify the format of the file or files being read from.

The Format page is similar in structure to the Properties page. A flat file has a number of properties that you can set different attributes for. Select the property in the tree and select the attributes you want to set from the Available properties to add window, it will then appear as
Parallel Job Developers Guide 3-49

Outputs Page

Stage Editors

a dependent property in the property tree and you can set its value as required. This tab sets the format information for the file at row level. You can override the settings for individual columns using the Edit Column Metadata dialog box (see page 3-28). Format details are also stored with table definitions, and you can use the Load button to load a format from a table definition stored in the DataStage Repository. The short-cut menu from the property tree gives access to the following functions: Format as. This applies a predefined template of properties. Choose from the following:

Delimited/quoted Fixed-width records UNIX line terminator DOS line terminator No terminator (fixed width) Mainframe (COBOL)

Add sub-property. Gives access to a list of dependent properties for the currently selected property (visible only if the property has dependents). Set to default. Appears if the currently selected property has been set to a non-default value, allowing you to re-select the default. Remove. Removes the currently selected property. This is disabled if the current property is mandatory. Remove all. Removes all the non-mandatory properties. Details of the properties you can set are given in the chapter describing the individual stage editors: Sequential File stage page 5-30 File Set stage page 6-25 External Source stage page 8-7 Column Import stage page 41-12

3-50

Parallel Job Developers Guide

Stage Editors

Outputs Page

Columns Tab
The Outputs page always has a Columns tab. This displays the column meta data for the selected output link in a grid.

There are various ways of populating the grid: If the other end of the link has meta data specified for it, this will be displayed in the Columns tab (meta data is associated with, and travels with a link). You can type the required meta data into the grid. When you have done this, you can click the Save button to save the meta data as a table definition in the Repository for subsequent reuse. You can load an existing table definition from the Repository. Click the Load button to be offered a choice of table definitions to load. If the stage you are editing is a general or restructure stage with a Mapping tab, you can drag data from the left pane to the right pane. This automatically populates the right pane and the Columns tab. If runtime column propagation is enabled in the DataStage Administrator, you can select the Runtime column propagation to specify that columns encountered by the stage can be used even if they are not explicitly defined in the meta data. There are some special considerations when using runtime column propagation with certain stage types: Sequential File File Set
Parallel Job Developers Guide 3-51

Outputs Page

Stage Editors

External Source External Target See the individual stage descriptions for details of these. If the selected output link is a reject link, the column meta data grid is read only and cannot be modified. If you select the options in the Grid Properties dialog box (see "Grid Properties" in DataStage Designer Guide), the Columns tab will also display two extra fields: Table Definition Reference and Column Definition Reference. These show the table definition and individual columns that the columns on the tab were derived from. If you click in a row and select Edit Row from the shortcut menu, the Edit Column meta data dialog box appears, which allows you edit the row details in a dialog box format. It also has a Parallel tab which allows you to specify properties that are peculiar to parallel job column definitions. The properties you can specify here are the same as those specified for input links (see page 3-27).

Mapping Tab
For processing and restructure stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the input columns and/or the generated columns. These are read only and cannot be modified on this tab. These

3-52

Parallel Job Developers Guide

Stage Editors

Outputs Page

columns represent the data that the stage has produced after it has processed the input data. The right pane shows the output columns for each link. This has a Derivations field where you can specify how the column is derived.You can fill it in by dragging input columns over, or by using the Auto-match facility. If you have not yet defined any output column definitions, dragging columns over will define them for you. If you have already defined output column definitions, DataStage performs the mapping for you as far as possible: you can do this explicitly using the auto-match facility, or implicitly by just visiting the Mapping tab and clicking OK (which is the equivalent of auto-matching on name). There is also a shortcut menu which gives access to a range of column selection and editing functions, including the facilities for selecting multiple columns and editing multiple derivations (this functionality is described in the Transformer chapter, page 16-9 and page 16-13). You may choose not to map all the left hand columns, for example if your output data is a subset of your input data, but be aware that, if you have Runtime Column Propagation turned on for that link, the data you have not mapped will appear on the output link anyway. You can also perform mapping without actually opening the stage editor. Select the stage in the Designer canvas and choose Auto-map from the shortcut menu. In the above example the left pane represents the data after it has been joined. The Expression field shows how the column has been derived, the Column Name shows the column after it has been joined. The right pane represents the data being output by the stage after the join. In this example the data has been mapped straight across. More details about mapping operations for the different stages are given in the individual stage descriptions:
Stage
Aggregator Join Funnel Lookup Sort Merge

Chapter
Chapter 18 Chapter 19 Chapter 22 Chapter 21 Chapter 23 Chapter 20

Stage
Change Capture Change Apply Difference Column Import Column Export Head Tail Peek

Chapter
Chapter 31 Chapter 32 Chapter 33 Chapter 41 Chapter 42 Chapter 49 Chapter 50 Chapter 52

Remove Duplicates Chapter 24 Sample Chapter 51

Parallel Job Developers Guide

3-53

Outputs Page

Stage Editors

Stage
Column Generator Copy

Chapter
Chapter 54 Chapter 27

Stage
SAS

Chapter
Chapter 38

A shortcut menu can be invoked from the right pane that allows you to: Find and replace column names. Validate a derivation you have entered. Clear an existing derivation. Append a new column. Select all columns. Insert a new column at the current position. Delete the selected column or columns. Cut and copy columns. Paste a whole column. Paste just the derivation from a column. The Find button opens a dialog box which allows you to search for particular output columns.

3-54

Parallel Job Developers Guide

Stage Editors

Outputs Page

The Auto-Match button opens a dialog box which will automatically map left pane columns onto right pane columns according to the specified criteria.

Select Location match to map input columns onto the output ones occupying the equivalent position. Select Name match to match by names. You can specify that all columns are to be mapped by name, or only the ones you have selected. You can also specify that prefixes and suffixes are ignored for input and output columns, and that case can be ignored.

Advanced Tab
The Advanced tab allows you to specify how DataStage buffers data being output from this stage. By default DataStage buffers data in such a way that no deadlocks can arise; a deadlock being the situation where a number of stages are mutually dependent, and are waiting for input from another stage and cannot output until they have received it. The size and operation of the buffer are usually the same for all links on all stages (the default values that the settings take can be set using environment variables see "Configuring for Enterprise Edition" of the Install and Upgrade Guide). The Advanced tab allows you to specify buffer settings on a per-link basis. You should only change the settings if you fully understand the consequences of your actions (otherwise you might cause deadlock situations to arise).

Parallel Job Developers Guide

3-55

Outputs Page

Stage Editors

Any changes you make on this tab will automatically be reflected in the Input Page Advanced Tab of the stage at the other end of this link

The settings are as follows: Buffering mode. Select one of the following from the drop-down list.

(Default). This will take whatever the default settings are as specified by the environment variables (this will be Auto-buffer unless you have explicitly changed the value of the APT_BUFFERING _POLICY environment variable). Auto buffer. Buffer output data only if necessary to prevent a dataflow deadlock situation. Buffer. This will unconditionally buffer all data output from this stage. No buffer. Do not buffer output data under any circumstances. This could potentially lead to deadlock situations if not used carefully.

If you choose the Auto buffer or Buffer options, you can also set the values of the various buffering parameters: Maximum memory buffer size (bytes). Specifies the maximum amount of virtual memory, in bytes, used per buffer. The default size is 3145728 (3 MB). Buffer free run (percent). Specifies how much of the available in-memory buffer to consume before the buffer resists. This is expressed as a percentage of Maximum memory buffer size. When the amount of data in the buffer is less than this value, new

3-56

Parallel Job Developers Guide

Stage Editors

Outputs Page

data is accepted automatically. When the data exceeds it, the buffer first tries to write some of the data it contains before accepting more. The default value is 50% of the Maximum memory buffer size. You can set it to greater than 100%, in which case the buffer continues to store data up to the indicated multiple of Maximum memory buffer size before writing to disk. Queue upper bound size (bytes). Specifies the maximum amount of data buffered at any time using both memory and disk. The default value is zero, meaning that the buffer size is limited only by the available disk space as specified in the configuration file (resource scratchdisk). If you set Queue upper bound size (bytes) to a non-zero value, the amount of data stored in the buffer will not exceed this value (in bytes) plus one block (where the data stored in a block cannot exceed 32 KB). If you set Queue upper bound size to a value equal to or slightly less than Maximum memory buffer size, and set Buffer free run to 1.0, you will create a finite capacity buffer that will not write to disk. However, the size of the buffer is limited by the virtual memory of your system and you can create deadlock if the buffer becomes full. Disk write increment (bytes). Sets the size, in bytes, of blocks of data being moved to/from disk by the buffering operator. The default is 1048576 (1 MB). Adjusting this value trades amount of disk access against throughput for small amounts of data. Increasing the block size reduces disk access, but may decrease performance when data is being read/written in smaller units. Decreasing the block size increases throughput, but may increase the amount of disk access.

Parallel Job Developers Guide

3-57

Outputs Page

Stage Editors

3-58

Parallel Job Developers Guide

4
Data Set Stage
The Data Set stage is a file stage. It allows you to read data from or write data to a data set. The stage can have a single input link or a single output link. It can be configured to execute in parallel or sequential mode. What is a data set? DataStage parallel extender jobs use data sets to manage data within a job. You can think of each link in a job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be key to good performance in a set of linked jobs. You can also manage data sets independently of a job using the Data Set Management utility, available from the DataStage Designer, Manager, or Director, see Chapter 57.

The stage editor has up to three pages, depending on whether you are reading or writing a data set:

Parallel Job Developers Guide

4-1

Must Dos

Data Set Stage

Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is present when you are writing to a data set. This is where you specify details about the data set being written to. Outputs Page. This is present when you are reading from a data set. This is where you specify details about the data set being read from.

Must Dos
DataStage has many defaults which means that it can be very easy to include Data Set stages in a job. This section specifies the minimum steps to take to get a Data Set stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic methods, you will learn where the shortcuts are when you get familiar with the product. The steps required depend on whether you are using the Data Set stage to read or write a data set.

Writing to a Data Set


In the Input Link Properties Tab specify the pathname of the control file for the target data set. Set the Update Policy property, or accept the default setting of Overwrite. Ensure column meta data has been specified for the data set (this may have already been done in a preceding stage).

Reading from a Data Set


In the Output Link Properties Tab specify the pathname of the control file for the source data set. Ensure column meta data has been specified for the data set.

Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes.

4-2

Parallel Job Developers Guide

Data Set Stage

Inputs Page

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the contents of the data set are processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire contents of the data set are processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. You can select Propagate, Set or Clear. If you select Set file read operations will request that the next stage preserves the partitioning as is. Propagate takes the setting of the flag from the previous stage. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about how the Data Set stage writes data to a data set. The Data Set stage can have only one input link. The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Data Set stage properties are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Parallel Job Developers Guide

4-3

Inputs Page

Data Set Stage

Input Link Properties Tab


The Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and to what data set. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows:
Category/ Property
Target/File Target/Update Policy

Values
pathname Append/Create (Error if exists)/ Overwrite/Use existing (Discard records)/Use existing (Discard records and schema)

Default
N/A Overwrite

Mandatory?
Y Y

Repeats?
N N

Dependent of
N/A N/A

Target Category
File The name of the control file for the data set. You can browse for the file or enter a job parameter. By convention, the file has the suffix .ds. Update Policy Specifies what action will be taken if the data set you are writing to already exists. Choose from: Append. Append any new data to the existing data. Create (Error if exists). DataStage reports an error if the data set already exists. Overwrite. Overwrites any existing data with new data. Use existing (Discard records). Keeps the existing data and discards any new data. Use existing (Discard records and schema). Keeps the existing data and discards any new data and its associated schema.
4-4 Parallel Job Developers Guide

Data Set Stage

Inputs Page

The default is Overwrite.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the data set. It also allows you to specify that the data should be sorted before being written. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the Data Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Data Set stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Data Set stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Data Set stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Data Set stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator.
Parallel Job Developers Guide 4-5

Inputs Page

Data Set Stage

Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for the Data Set stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the data set. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with Auto methods). Select the check boxes as follows: Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

4-6

Parallel Job Developers Guide

Data Set Stage

Outputs Page

If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about how the Data Set stage reads data from a data set. The Data Set stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the output link. Details about Data Set stage properties and formatting are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Output Link Properties Tab


The Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from the data set. A Data Set stage only has one property, but this is mandatory.
Category/ Property
Source/File

Values
pathname

Default Mandatory?
N/A Y

Repeats? Dependent of
N N/A

Source Category
File The name of the control file for the data set. You can browse for the file or enter a job parameter. By convention the file has the suffix .ds.

Parallel Job Developers Guide

4-7

Outputs Page

Data Set Stage

4-8

Parallel Job Developers Guide

5
Sequential File Stage
The Sequential File stage is a file stage. It allows you to read data from or write data one or more flat files. The stage can have a single input link or a single output link, and a single rejects link.

When you edit a Sequential File stage, the Sequential File stage editor appears. This is based on the generic stage editor described in Chapter 3, "Stage Editors."

Parallel Job Developers Guide

5-1

Sequential File Stage

The stage executes in parallel mode if reading multiple files but executes sequentially if it is only reading one file. By default a complete file will be read by a single node (although each node might read more than one file). For fixed-width files, however, you can configure the stage to behave differently: You can specify that single files can be read by multiple nodes. This can improve performance on cluster systems. See "Read From Multiple Nodes" on page 5-29. You can specifiy that a number of readers run on a single node. This means, for example, that a single file can be partitioned as it is read (even though the stage is constrained to running sequentially on the conductor node). See "Number Of Readers Per Node" on page 5-28. (These two options are mutually exclusive.) The stage executes in parallel if writing to multiple files, but executes sequentially if writing to a single file. Each node writes to a single file, but a node can write more than one file. When reading or writing a flat file, DataStage needs to know something about the format of the file. The information required is how the file is divided into rows and how rows are divided into columns. You specify this on the Format tab. Settings for individual columns can be overridden on the Columns tab using the Edit Column Metadata dialog box. The stage editor has up to three pages, depending on whether you are reading or writing a file: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is present when you are writing to a flat file. This is where you specify details about the file or files being written to. Outputs Page. This is present when you are reading from a flat file and/or have a reject link. This is where you specify details about the file or files being read from. There are one or two special points to note about using runtime column propagation (RCP) with Sequential stages. See "Using RCP With Sequential Stages" on page 5-42 for details.

5-2

Parallel Job Developers Guide

Sequential File Stage

Example of Writing a Sequential File

Example of Writing a Sequential File


In the following example, the Sequential File stage is set up to write a comma-delimited file. Here is a sample of the data as it will be written:
2,Scottish Longbreads,10.00,15,25/04/2001,Should Eat Warm,Q2 6,Maxilaku,16.00,30,02/01/2002,,Q2 10,Perth Pasties,26.20,10,12/08/2001,Warm Before Heating,Q2 14,Outback Lager,12.00,5,02/01/2002,Do Not Shake,Q2 18,Singaporean Hokkien Fried Mee,11.20,2,02/01/2002,,Q2 22,Gudbrandsdalsost,28.80,7,02/01/2002,,Q2 26,Escargots de Bourgogne,10.60,30,02/01/2002,,Q2 30,Outback Lager,12.00,30,02/01/2002,Do Not Shake,Q2 34,Flotemysost,17.20,30,02/01/2002,,Q2 38,Chartreuse verte,14.40,4,02/01/2002,,Q2 42,Spegesild,9.60,30,02/01/2002,,Q2 46,Konbu,4.80,12,02/01/2002,,Q2 50,Nord-Ost Matjeshering,20.70,35,02/01/2002,,Q2 54,Raclette Courdavault,44.00,9,22/12/2001,,Q2 58,Gnocchi di nonna Alice,30.40,12,02/01/2002,,Q2 62,Zaanse koeken,7.60,16,02/01/2002,,Q2 66,Filo Mix,5.60,8,02/01/2002,Please Hurry,Q2 70,Mascarpone Fabioli,25.60,6,02/01/2002,,Q2

The meta data for the file is defined in the Columns tab as follows:

The Format tab is set as follows to define that the stage will write a file where each column is delimited by a comma, there is no final

Parallel Job Developers Guide

5-3

Example of Reading a Sequential File

Sequential File Stage

delimiter, and any dates in the data are expected to have the format dd/mm/yyyy, rather than yyyy-mm-dd, which is the default format.:

Example of Reading a Sequential File


In the following example, the sequential file stage is set up to read a fixed width file. Here is a sample of the data in the file:
0136.801205/04/2001 0210.001525/04/2001 0316.803002/01/2002 0414.704002/01/2002 0517.200202/01/2002 0616.003002/01/2002 0744.001012/08/2001 0814.403002/01/2002 0950.002502/01/2002 1026.201012/08/2001 1120.701012/08/2001 1239.401012/08/2001 1310.000302/01/2002 1412.000502/01/2002 1528.800102/01/2002 1636.802021/06/2001

5-4

Parallel Job Developers Guide

Sequential File Stage

Example of Reading a Sequential File

The meta data for the file is defined in the Columns tab as follows:

The Format tab is set as follows to define that the stage is reading a fixed width file where each row is delimited by a UNIX newline, and the columns have no delimiter:

Parallel Job Developers Guide

5-5

Must Dos

Sequential File Stage

Must Dos
DataStage has many defaults which means that it can be very easy to include Sequential File stages in a job. This section specifies the minimum steps to take to get a Sequential File stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. The steps required depend on whether you are using the Sequential File stage to read or write a file.

Writing to a File
In the Input Link Properties Tab specify the pathname of the file being written to (repeat this for writing to multiple files). The other properties all have default values, which you can change or not as required. In the Input Link Format Tab specify format details for the file(s) you are writing to, or accept the defaults (variable length columns enclosed in double quotes and delimited by commas, rows delimited with UNIX newlines). Ensure column meta data has been specified for the file(s) (this can be achieved via a schema file if required).

Reading from a File


In the Output Link Properties Tab:

In Read Method, specify whether to read specific files (the default) or all files whose name fits a pattern. If you are reading specific files, specify the pathname of the file being read from (repeat this for reading multiple files). If you are reading files that fit a pattern, specify the name pattern to match. Accept the default for the options or specify new settings (available options depend on the Read Method).

In the Output Link Format Tab specify format details for the file(s) you are reading from, or accept the defaults (variable length columns enclosed in double quotes and delimited by commas, rows delimited with UNIX newlines). Ensure column meta data has been specified for the file(s) (this can be achieved via a schema file if required).

5-6

Parallel Job Developers Guide

Sequential File Stage

Stage Page

Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes. The NLS Map tab appears if you have NLS enabled on your system, it allows you to specify a character set map for the stage.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. When a stage is reading or writing a single file the Execution Mode is sequential and you cannot change it. When a stage is reading or writing multiple files, the Execution Mode is parallel and you cannot change it. In parallel mode, the files are processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire contents of the file are processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. You can select Set or Clear. If you select Set file read operations will request that the next stage preserves the partitioning as is (it is ignored for file write operations). If you set the Keep File Partitions output property this will automatically set the preserve partitioning flag. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

NLS Map Tab


The NLS Map tab allows you to define a character set map for the Sequential File stage. This overrides the default character set map set

Parallel Job Developers Guide

5-7

Inputs Page

Sequential File Stage

for the project or the job. You can specify that the map be supplied as a job parameter if required. You can also select Allow per-column mapping. This allows character set maps to be specified for individual columns within the data processed by the Sequential File stage. An extra property, NLS Map, appears in the Columns grid in the Columns tab, but note that only ustring data types allow you to set an NLS map value (see "Data Types" on page 2-28).

Inputs Page
The Inputs page allows you to specify details about how the Sequential File stage writes data to one or more flat files. The Sequential File stage can have only one input link, but this can write to multiple files. The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the file or files. The Formats tab gives information about the format of the files being written. The Columns tab specifies the column definitions of data being written. The Advanced tab allows you to change the default buffering settings for the input link. Details about Sequential File stage properties, partitioning, and formatting are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

5-8

Parallel Job Developers Guide

Sequential File Stage

Inputs Page

Input Link Properties Tab


The Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and to what files. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Target/File Target/File Update Mode Options/Cleanup On Failure Options/Reject Mode Options/Filter Options/Schema File

Values
Pathname Append/ Create/ Overwrite True/False

Default
N/A Overwrite

Mandatory?
Y Y

Repeats?
Y N

Dependent of
N/A N/A

True

Y Y N N

N N N N

N/A N/A N/A N/A

Continue/Fail/ Continue Save Command Pathname N/A N/A

Target Category
File This property defines the flat file that the incoming data will be written to. You can type in a pathname, or browse for a file. You can specify multiple files by repeating the File property. Do this by selecting the Properties item at the top of the tree, and clicking on File in the Available properties to add window. Do this for each extra file you want to specify. You must specify at least one file to be written to, which must exist unless you specify a File Update Mode of Create or Overwrite. File Update Mode This property defines how the specified file or files are updated. The same method applies to all files being written to. Choose from Append to append to existing files, Overwrite to overwrite existing files, or Create to create a new file. If you specify the Create property for a file that already exists you will get an error at runtime.

Parallel Job Developers Guide

5-9

Inputs Page

Sequential File Stage

By default this property is set to Overwrite.

Options Category
Cleanup On Failure This is set to True by default and specifies that the stage will delete any partially written files if the stage fails for any reason. Set this to False to specify that partially written files should be left. Reject Mode This specifies what happens to any data records that are not written to a file for some reason. Choose from Continue to continue operation and discard any rejected rows, Fail to cease writing if any rows are rejected, or Save to send rejected rows down a reject link. Continue is set by default. Filter This is an optional property. You can use this to specify that the data is passed through a filter program before being written to the file or files. Specify the filter command, and any required arguments, in the Property Value box. Schema File This is an optional property. By default the Sequential File stage will use the column definitions defined on the Columns and Format tabs as a schema for writing to the file. You can, however, specify a file containing a schema instead (note, however, that if you have defined columns on the Columns tab, you should ensure these match the schema file). Type in a pathname or browse for a schema file.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the file or files. It also allows you to specify that the data should be sorted before being written. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file.

5-10

Parallel Job Developers Guide

Sequential File Stage

Inputs Page

If the Sequential File stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Sequential File stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Sequential File stage is set to execute in parallel (i.e., is writing to multiple files), then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Sequential File stage is set to execute in sequential mode (i.e., is writing to a single file), but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Sequential File stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .

Parallel Job Developers Guide

5-11

Inputs Page

Sequential File Stage

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for the Sequential File stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the file or files. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

5-12

Parallel Job Developers Guide

Sequential File Stage

Inputs Page

Input Link Format Tab


The Format tab allows you to supply information about the format of the flat file or files to which you are writing. The tab has a similar format to the Properties tab and is described on page 3-44. If you do not alter any of the Format settings, the Sequential File stage will produce a file of the following format: File comprises variable length columns contained within double quotes. All columns are delimited by a comma, except for the final column in a row. Rows are delimited by a UNIX newline. You can use the Format As item from the shortcut menu in the Format Tab to quickly change to a fixed-width column format, using DOS newlines as row delimiters, or producing a COBOL format file. You can use the Defaults button to change your default settings. Use the Format tab to specify your required settings, then click Defaults Save current as default. All your sequential files will use your settings by default from now on. If your requirements change, you can choose Defaults Reset defaults from factory settings to go back to the original defaults as described above. Once you have done this, you then have to click Defaults Set current from default for the new defaults to take effect. To change individual properties, select a property type from the main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to set window. You can then set a value for that property in the Property Value box. Popup help for each of the available properties appears if you hover the mouse pointer over it. Any property that you set on this tab can be overridden at the column level by setting properties for individual columns on the Edit Column Metadata dialog box (see page 3-26). This description uses the terms record and row and field and column interchangeably. The following sections list the property types and properties available for each type. Record level These properties define details about how data records are formatted in the flat file. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are:
Parallel Job Developers Guide 5-13

Inputs Page

Sequential File Stage

Fill char. Specify an ASCII character or a value in the range 0 to 255. You can also choose Space or Null from a drop-down list. This character is used to fill any gaps in a written record caused by column positioning properties. Set to 0 by default (which is the NULL character). For example, to set it to space you could also type in the space character or enter 32. Note that this value is restricted to one byte, so you cannot specify a multi-byte Unicode character. Final delimiter string. Specify a string to be written after the last column of a record in place of the column delimiter. Enter one or more characters, this precedes the record delimiter if one is used. Mutually exclusive with Final delimiter, which is the default. For example, if you set Delimiter to comma (see under "Field Defaults" for Delimiter) and Final delimiter string to , (comma space you do not need to enter the inverted commas) all fields are delimited by a comma, except the final field, which is delimited by a comma followed by an ASCII space character. Final delimiter. Specify a single character to be written after the last column of a record in place of the field delimiter. Type a character or select one of whitespace, end, none, null, tab, or comma. See the following diagram for an illustration.

whitespace. The last column of each record will not include any trailing white spaces found at the end of the record. end. The last column of each record does not include the field delimiter. This is the default setting. none. The last column of each record does not have a delimiter; used for fixed-width fields. null. The last column of each record is delimited by the ASCII null character. comma. The last column of each record is delimited by the ASCII comma character. tab. The last column of each record is delimited by the ASCII tab character.
Record delimiter Field 1 , Field 1 , Field 1 , Last field nl

Final Delimiter = end Field Delimiter

Field 1

Field 1

Field 1

, Last field

, nl

Final Delimiter = comma

When writing, a space is now inserted after every field except the last in the record. Previously, a space was inserted after every field

5-14

Parallel Job Developers Guide

Sequential File Stage

Inputs Page

including the last. (If you want to revert to the pre-release 7.5 behavior of inserting a space after the last field, set the APT_FINAL_DELIM_COMPATIBLE environment variable. Intact. The intact property specifies an identifier of a partial schema. A partial schema specifies that only the column(s) named in the schema can be modified by the stage. All other columns in the row are passed through unmodified. (See "Partial Schemas" in Appendix A for details.) The file containing the partial schema is specified in the Schema File property on the Properties tab (see page 5-9). This property has a dependent property, Check intact, but this is not relevant to input links. Record delimiter string. Specify a string to be written at the end of each record. Enter one or more characters. This is mutually exclusive with Record delimiter, which is the default, record type and record prefix. Record delimiter. Specify a single character to be written at the end of each record. Type a character or select one of the following:

UNIX Newline (the default) null

(To implement a DOS newline, use the Record delimiter string property set to \R\N or choose Format as DOS line terminator from the shortcut menu.) Record delimiter is mutually exclusive with Record delimiter string, Record prefix, and Record type. Record length. Select Fixed where fixed length fields are being written. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes. This is not used by default (default files are commadelimited). The record is padded to the specified length with either zeros or the fill character if one has been specified. Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. It is set to 1 by default. This is mutually exclusive with Record delimiter, which is the default, and record delimiter string and record type. Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is written as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, VBS, or VR.

Parallel Job Developers Guide

5-15

Inputs Page

Sequential File Stage

This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix and by default is not used. Field Defaults Defines default properties for columns written to the file or files. These are applied to all columns written, but can be overridden for individual columns from the Columns tab using the Edit Column Metadata dialog box. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Actual field length. Specifies the number of bytes to fill with the Fill character when a field is identified as null. When DataStage identifies a null field, it will write a field of this length full of Fill characters. This is mutually exclusive with Null field value. Delimiter. Specifies the trailing delimiter of all fields in the record. Type an ASCII character or select one of whitespace, end, none, null, comma, or tab.

whitespace. Whitespace characters at the end of a column are ignored, i.e., are not treated as part of the column. end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the same as a setting of None which is used for fields with fixed-width columns. none. No delimiter (used for fixed-width). null. ASCII Null character is used. comma. ASCII comma character is used. tab. ASCII tab character is used.

Delimiter string. Specify a string to be written at the end of each field. Enter one or more characters. This is mutually exclusive with Delimiter, which is the default. For example, specifying , (comma space you do not need to enter the inverted commas) would have each field delimited by , unless overridden for individual fields. Null field length. The length in bytes of a variable-length field that contains a null. When a variable-length field is written, DataStage writes a length value of null field length if the field contains a null. This property is mutually exclusive with null field value. Null field value. Specifies the value written to null field if the source is set to null. Can be a number, string, or C-type literal escape character. For example, you can represent a byte value by

5-16

Parallel Job Developers Guide

Sequential File Stage

Inputs Page

\ooo, where each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F You must use this . form to encode non-printable byte values. This property is mutually exclusive with Null field length and Actual length. For a fixed width data representation, you can use Pad char (from the general section of Type defaults) to specify a repeated trailing character if the value you specify is shorter than the fixed width of the field. Prefix bytes. Specifies that each column in the data file is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the columns length or the tag value for a tagged field. You can use this option with variable-length fields. Variable-length fields can be either delimited by a character or preceded by a 1-, 2, or 4-byte prefix containing the field length. DataStage inserts the prefix before each field. This property is mutually exclusive with the Delimiter, Quote, and Final Delimiter properties, which are used by default. Print field. This property is not relevant for input links. Quote. Specifies that variable length fields are enclosed in single quotes, double quotes, or another character or pair of characters. Choose Single or Double, or enter a character. This is set to double quotes by default. When writing, DataStage inserts the leading quote character, the data, and a trailing quote character. Quote characters are not counted as part of a fields length. Vector prefix. For fields that are variable length vectors, specifies a 1-, 2-, or 4-byte prefix containing the number of elements in the vector. You can override this default prefix for individual vectors. Variable-length vectors must use either a prefix on the vector or a link to another field in order to specify the number of elements in the vector. If the variable length vector has a prefix, you use this property to indicate the prefix length. DataStage inserts the element count as a prefix of each variable-length vector field. By default, the prefix length is assumed to be one byte. Type Defaults These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type.

Parallel Job Developers Guide

5-17

Inputs Page

Sequential File Stage

General These properties apply to several data types (unless overridden at column level): Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

little-endian. The high byte is on the right. big-endian. The high byte is on the left. native-endian. As defined by the native format of the machine. This is the default.

Data Format. Specifies the data representation format of a field. Applies to fields of all data types except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that is neither string nor raw. Choose from:

binary text (the default)

A setting of binary has different meanings when applied to different data types:

For decimals, binary means packed. For other numerical data types, binary means not text . For dates, binary is equivalent to specifying the julian property for the date field. For time, binary is equivalent to midnight_seconds. For timestamp, binary specifies that the first integer contains a Julian day count for the date portion of the timestamp and the second integer specifies the time portion of the timestamp as the number of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.

By default data is formatted as text, as follows:

For the date data type, text specifies that the data to be written contains a text-based date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the decimal data type: a field represents a decimal in a string format with a leading space or '-' followed by decimal digits with an embedded decimal point if the scale is not zero. The destination string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored.

5-18

Parallel Job Developers Guide

Sequential File Stage

Inputs Page

For numeric fields (int8, int16, int32, uint8, uint16, uint32, sfloat, and dfloat): DataStage assumes that numeric fields are represented as text. For the time data type: text specifies that the field represents time in the text-based form %hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the timestamp data type: text specifies a text-based timestamp in the form %yyyy-%mm-%dd %hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see NLS Guide).

(See page 2-28 for a description of data types.) Field max width. The maximum number of bytes in a column represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width character set, you can calculate the length exactly. If you are using variable-length character set, calculate an adequate maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. Field width. The number of bytes in a field represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the number of bytes exactly. If its a variable length encoding, base your calculation on the width and frequency of your variable-width characters. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. If you specify neither field width nor field max width, numeric fields written as text have the following number of bytes as their maximum width:

8-bit signed or unsigned integers: 4 bytes 16-bit signed or unsigned integers: 6 bytes 32-bit signed or unsigned integers: 11 bytes 64-bit signed or unsigned integers: 21 bytes single-precision float: 14 bytes (sign, digit, decimal point, 7 fraction, "E", sign, 2 exponent) double-precision float: 24 bytes (sign, digit, decimal point, 16 fraction, "E", sign, 3 exponent)

Pad char. Specifies the pad character used when strings or numeric values are written to an external string representation. Enter a character (single-byte for strings, can be multi-byte for

Parallel Job Developers Guide

5-19

Inputs Page

Sequential File Stage

ustrings) or choose null or space. The pad character is used when the external string representation is larger than required to hold the written field. In this case, the external string is filled with the pad character to its full length. Space is the default. Applies to string, ustring, and numeric data types and record, subrec, or tagged types if they contain at least one field of this type. Character set. Specifies the character set. Choose from ASCII or EBCDIC. The default is ASCII. Applies to all data types except raw and ustring and record, subrec, or tagged containing no fields other than raw or ustring. String These properties are applied to columns with a string data type, unless overridden at column level. Export EBCDIC as ASCII. Select this to specify that EBCDIC characters are written as ASCII characters. Applies to fields of the string data type and record, subrec, or tagged fields if they contain at least one field of this type. Import ASCII as EBCDIC. Not relevant for input links. For ASCII-EBCDIC and EBCDIC-ASCII conversion tables, see DataStage Developers Help. Decimal These properties are applied to columns with a decimal data type unless overridden at column level. Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No. The default is No. Decimal separator. Specify the ASCII character that acts as the decimal separator (period by default). Packed. Select an option to specify what the decimal columns contain, choose from:

Yes to specify that the decimal columns contain data in packed decimal format (the default). This has the following subproperties:

Check. Select Yes to verify that data is packed, or No to not verify. Signed. Select Yes to use the existing sign when writing decimal columns. Select No to write a positive sign (0xf) regardless of the columns actual sign value.

5-20

Parallel Job Developers Guide

Sequential File Stage

Inputs Page

No (separate) to specify that they contain unpacked decimal with a separate sign byte. This has the following sub-property:

Sign Position. Choose leading or trailing as appropriate.

No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This has the following subproperty:

Sign Position. Choose leading or trailing as appropriate.

No (overpunch) to specify that the field has a leading or end byte that contains a character which specifies both the numeric value of that byte and whether the number as a whole is negatively or positively signed. This has the following subproperty:

Sign Position. Choose leading or trailing as appropriate. Precision. Specifies the precision where a decimal column is written in text format. Enter a number. When a decimal is written to a string representation, DataStage uses the precision and scale defined for the source decimal field to determine the length of the destination string. The precision and scale properties override this default. When they are defined, DataStage truncates or pads the source decimal to fit the size of the destination string. If you have also specified the field width property, DataStage truncates or pads the source decimal to fit the size specified by field width. Rounding. Specifies how to round a decimal column when writing it. Choose from:

up (ceiling). Truncate source column towards positive infinity. This mode corresponds to the IEEE 754 Round Up mode. For example, 1.4 becomes 2, -1.6 becomes -1. down (floor). Truncate source column towards negative infinity. This mode corresponds to the IEEE 754 Round Down mode. For example, 1.6 becomes 1, -1.4 becomes -2. nearest value. Round the source column towards the nearest representable value. This mode corresponds to the COBOL ROUNDED mode. For example, 1.4 becomes 1, 1.5 becomes 2, 1.4 becomes -1, -1.5 becomes -2. truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, truncate to the scale size of the destination decimal. This mode corresponds to the COBOL INTEGER-PART function. Using this method 1.6 becomes 1, -1.6 becomes -1.

Parallel Job Developers Guide

5-21

Inputs Page

Sequential File Stage

Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination. By default, when the DataStage writes a source decimal to a string representation, it uses the precision and scale defined for the source decimal field to determine the length of the destination string. You can override the default by means of the precision and scale properties. When you do, DataStage truncates or pads the source decimal to fit the size of the destination string. If you have also specified the field width property, DataStage truncates or pads the source decimal to fit the size specified by field width. Numeric These properties apply to integer and float fields unless overridden at column level. C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a Clanguage format string used for writing integer or floating point strings. This is passed to sprintf(). For example, specifying a Cformat of %x and a field width of 8 ensures that integers are written as 8-byte hexadecimal strings. In_format. This property is not relevant for input links.. Out_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sprintf(). By default, DataStage invokes the C sprintf() function to convert a numeric field formatted as either integer or floating point data to a string. If this function does not output data in a satisfactory format, you can specify the out_format property to pass formatting arguments to sprintf(). Date These properties are applied to columns with a date data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text. Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). Format string. The string format of a date. By default this is %yyyy-%mm-%dd. The Format string can contain one or a combination of the following elements:

%dd: A two-digit day. %mm: A two-digit month.

5-22

Parallel Job Developers Guide

Sequential File Stage

Inputs Page

%year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff, for example %1970yy. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 366). %mmm: Three-character month abbreviation.

The format_string is subject to the following restrictions:


It cannot have more than one element of the same type, for example it cannot contain two %dd elements. It cannot have both %dd and %ddd. It cannot have both %yy and %yyyy. It cannot have both %mm and %ddd. It cannot have both %mmm and %ddd. It cannot have both %mm and %mmm. If it has %dd, it must have %mm or %mmm. It must have exactly one of %yy or %yyyy.

When you specify a date format string, prefix each component with the percent symbol (%). Separate the strings components with any character except the percent sign (%). If this format string does not include a day, it is set to the first of the month in the destination field. If the format string does not include the month and day, they default to January 1. Note that the format string must contain a month if it also contains a day; that is, you cannot omit only the month. The year_cutoff is the year defining the beginning of the century in which all two digit years fall. By default, the year cutoff is 1900; therefore, a two-digit year of 97 represents 1997. You can also set this using the environment variable APT_DATE_CENTURY_BREAK_YEAR (see "APT_DATE_CENTURY_BREAK_YEAR" in Parallel Job Advanced Developers Guide), but this is overridden by %year_cutoffyy if you have set it. You can specify any four-digit year as the year cutoff. All two-digit years then specify the next possible year ending in the specified two digits that is the same or greater than the cutoff. For example, if you set the year cutoff to 1930, the two-digit year 30 corresponds to 1930, and the two-digit year 29 corresponds to 2029.

Parallel Job Developers Guide

5-23

Inputs Page

Sequential File Stage

Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT. Time These properties are applied to columns with a time data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text. Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss. The possible components of the time format string are:

%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent symbol. Separate the strings components with any character except the percent sign (%). Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight. Timestamp These properties are applied to columns with a timestamp data type unless overridden at column level. Format string. Specifies the format of a column representing a timestamp as a string. Defaults to %yyyy-%mm-%dd %hh:%nn:%ss. Specify the format as follows: For the date:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year.
Parallel Job Developers Guide

5-24

Sequential File Stage

Outputs Page

%ddd: Day of year in three-digit form (range of 1 - 366)

For the time:


%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent sign (%). Separate the strings components with any character except the percent sign (%).

Outputs Page
The Outputs page allows you to specify details about how the Sequential File stage reads data from one or more flat files. The Sequential File stage can have only one output link, but this can read from multiple files. It can also have a single reject link. This is typically used when you are writing to a file and provides a location where records that have failed to be written to a file for some reason can be sent. When you are reading files, you can use a reject link as a destination for rows that do not match the expected column definitions. The Output name drop-down list allows you to choose whether you are looking at details of the main output link (the stream link) or the reject link. The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Formats tab gives information about the format of the files being read. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. Details about Sequential File stage properties and formatting are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Parallel Job Developers Guide

5-25

Outputs Page

Sequential File Stage

Output Link Properties Tab


The Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from what files. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Source/File

Values
pathname

Default
N/A

Mandatory? Repeats?
Y if Read Method = Specific Files(s) Y if Read Method = Field Pattern Y Y

Dependent of
N/A

Source/File Pattern Source/Read Method Options/Missing File Mode Options/Keep file Partitions Options/Reject Mode Options/Report Progress Options/Filter Options/File Name Column Options/Number Of Readers Per Node Options/Schema File

pathname

N/A

N/A

Specific File(s)/File Pattern Error/OK/ Depends True/False Continue/ Fail/Save Yes/No command column name number

Specific Files(s)

N/A

Depends False Continue Yes N/A fileNameColumn 1

Y if File used Y Y Y N N N

N N N N N N N

N/A N/A N/A N/A N/A N/A N/A

pathname

N/A

N/A

5-26

Parallel Job Developers Guide

Sequential File Stage

Outputs Page

Source Category
File This property defines the flat file that data will be read from. You can type in a pathname, or browse for a file. You can specify multiple files by repeating the File property. Do this by selecting the Properties item at the top of the tree, and clicking on File in the Available properties to add window. Do this for each extra file you want to specify. File Pattern Specifies a group of files to import. Specify file containing a list of files or a job parameter representing the file. The file could also contain be any valid shell expression, in Bourne shell syntax, that generates a list of file names. Read Method This property specifies whether you are reading from a specific file or files or using a file pattern to select files (e.g., *.txt).

Options Category
Missing File Mode Specifies the action to take if one of your File properties has specified a file that does not exist. Choose from Error to stop the job, OK to skip the file, or Depends, which means the default is Error, unless the file has a node name prefix of *: in which case it is OK. The default is Depends. Keep file Partitions Set this to True to partition the imported data set according to the organization of the input file(s). So, for example, if you are reading three files you will have three partitions. Defaults to False. Reject Mode Allows you to specify behavior if a read record does not match the expected schema. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue.

Parallel Job Developers Guide

5-27

Outputs Page

Sequential File Stage

Report Progress Choose Yes or No to enable or disable reporting. By default the stage displays a progress report at each 10% interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB, records are fixed length, and there is no filter on the file. Filter This is an optional property. You can use this to specify that the data is passed through a filter program after being read from the files. Specify the filter command, and any required arguments, in the Property Value box. File Name Column This is an optional property. It adds an extra column of type VarChar to the output of the stage, containing the pathname of the file the record is read from. You should also add this column manually to the Columns definitions to ensure that the column is not dropped if you are not using runtime column propagation, or it is turned off at some point. Number Of Readers Per Node This is an optional property and only applies to files containing fixedlength records, it is mutually exclusive with the Read from multiple nodes property. Specifies the number of instances of the file read operator on a processing node. The default is one operator per node per input data file. If numReaders is greater than one, each instance of the file read operator reads a contiguous range of records from the input file. The starting record location in the file for each operator, or seek location, is determined by the data file size, the record length, and the number of instances of the operator, as specified by numReaders. The resulting data set contains one partition per instance of the file read operator, as determined by numReaders. This provides a way of partitioning the data contained in a single file. Each node reads a single file, but the file can be divided according to

5-28

Parallel Job Developers Guide

Sequential File Stage

Outputs Page

the number of readers per node, and written to separate partitions. This method can result in better I/O performance on an SMP system.
Number of readers per node = 4 File Partitioned data set

Reader Reader Reader Reader Node

Read From Multiple Nodes This is an optional property and only applies to files containing fixedlength records, it is mutually exclusive with the Number of Readers Per Node property. Set this to Yes to allow individual files to be read by several nodes. This can improve performance on a cluster system. DataStage knows the number of nodes available, and using the fixed length record size, and the actual size of the file to be read, allocates the reader on each node a spearate region within the file to process. The regions will be of roughly equal size.
Read from multiple nodes = Yes Partitioned data set File Reader Node

Reader Node

Reader Node

Reader Node

Schema File This is an optional property. By default the Sequential File stage will use the column definitions defined on the Columns and Format tabs as a schema for reading the file. You can, however, specify a file

Parallel Job Developers Guide

5-29

Outputs Page

Sequential File Stage

containing a schema instead (note, however, that if you have defined columns on the Columns tab, you should ensure these match the schema file). Type in a pathname or browse for a schema file.

Reject Links
You cannot change the properties of a Reject link. The Properties tab for a reject link is blank. Similarly, you cannot edit the column definitions for a reject link. For writing files, the link uses the column definitions for the input link. For reading files, the link uses a single column called rejected containing raw data for columns rejected after reading because they do not match the schema.

Output Link Format Tab


The Format tab allows you to supply information about the format of the flat file or files which you are reading. The tab has a similar format to the Properties tab and is described on page 3-44. If you do not alter any of the Format settings, the Sequential File stage will expect to read a file of the following format: File comprises variable length columns contained within double quotes. All columns are delimited by a comma, except for the final column in a row. Rows are delimited by a UNIX newline. You can use the Defaults button to change your default settings. Use the Format tab to specify your required settings, then click Defaults Save current as default. All your sequential files will use your settings by default from now on. If your requirements change, you can choose Defaults Reset defaults from factory settings to go back to the original defaults as described above. Once you have done this, you then have to click Defaults Set current from default for the new defaults to take effect. You can use the Format As item from the shortcut menu in the Format Tab to quickly change to a fixed-width column format, using DOS newlines as row delimiters, or producing a COBOL format file. Select a property type from the main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to set window. You can then set a value for that property in the Property Value box. Pop-up help for each of the available properties appears if you hover the mouse pointer over it.

5-30

Parallel Job Developers Guide

Sequential File Stage

Outputs Page

Any property that you set on this tab can be overridden at the column level by setting properties for individual columns on the Edit Column Metadata dialog box (see page 3-26). This description uses the terms record and row and field and column interchangeably. The following sections list the Property types and properties available for each type. Record level These properties define details about how data records are formatted in the flat file. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Fill char. Does not apply to output links. Final delimiter string. Specify the string written after the last column of a record in place of the column delimiter. Enter one or more characters, this precedes the record delimiter if one is used. Mutually exclusive with Final delimiter, which is the default. For example, if you set Delimiter to comma (see under "Field Defaults" for Delimiter) and Final delimiter string to , (comma space you do not need to enter the inverted commas) all fields are delimited by a comma, except the final field, which is delimited by a comma followed by an ASCII space character. DataStage skips the specified delimiter string when reading the file. Final delimiter. Specify the single character written after the last column of a record in place of the field delimiter. Type a character or select one of whitespace, end, none, null, tab, or comma. DataStage skips the specified delimiter string when reading the file. See the following diagram for an illustration.

whitespace. The last column of each record will not include any trailing white spaces found at the end of the record. end. The last column of each record does not include the field delimiter. This is the default setting. none. The last column of each record does not have a delimiter, used for fixed-width fields. null. The last column of each record is delimited by the ASCII null character. comma. The last column of each record is delimited by the ASCII comma character.

Parallel Job Developers Guide

5-31

Outputs Page

Sequential File Stage

tab. The last column of each record is delimited by the ASCII tab character.
Record delimiter Field 1 , Field 1 , Field 1 , Last field nl

Final Delimiter = end Field Delimiter

Field 1

Field 1

Field 1

, Last field

, nl

Final Delimiter = comma

Intact. The intact property specifies an identifier of a partial schema. A partial schema specifies that only the column(s) named in the schema can be modified by the stage. All other columns in the row are passed through unmodified. (See "Partial Schemas" in Appendix A for details.) The file containing the partial schema is specified in the Schema File property on the Outputs tab. This property has a dependent property:

Check intact. Select this to force validation of the partial schema as the file or files are imported. Note that this can degrade performance.

Record delimiter string. Specify the string at the end of each record. Enter one or more characters. This is mutually exclusive with Record delimiter, which is the default, and record type and record prefix. Record delimiter. Specify the single character at the end of each record. Type a character or select one of the following:

UNIX Newline (the default) null

(To specify a DOS newline, use the Record delimiter string property set to \R\N or choose Format as DOS line terminator from the shortcut menu.) Record delimiter is mutually exclusive with Record delimiter string, Record prefix, and record type. Record length. Select Fixed where fixed length fields are being read. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes. This is not used by default (default files are commadelimited). Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. It is set to 1 by default. This is mutually exclusive with Record delimiter, which is the default, and record delimiter string and record type.

5-32

Parallel Job Developers Guide

Sequential File Stage

Outputs Page

Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is written as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, VBS, or VR. This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix and by default is not used. Field Defaults Defines default properties for columns read from the file or files. These are applied to all columns, but can be overridden for individual columns from the Columns tab using the Edit Column Metadata dialog box. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Actual field length. Specifies the actual number of bytes to skip if the fields length equals the setting of the null field length property. Delimiter. Specifies the trailing delimiter of all fields in the record. Type an ASCII character or select one of whitespace, end, none, null, comma, or tab. DataStage skips the delimiter when reading.

whitespace. Whitespace characters at the end of a column are ignored, i.e., are not treated as part of the column. end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the same as a setting of None which is used for fields with fixed-width columns. none. No delimiter (used for fixed-width). null. ASCII Null character is used. comma. ASCII comma character is used. tab. ASCII tab character is used.

Delimiter string. Specify the string at the end of each field. Enter one or more characters. This is mutually exclusive with Delimiter, which is the default. For example, specifying , (comma space you do not need to enter the inverted commas) specifies each field is delimited by , unless overridden for individual fields. DataStage skips the delimiter string when reading.

Parallel Job Developers Guide

5-33

Outputs Page

Sequential File Stage

Null field length. The length in bytes of a variable-length field that contains a null. When a variable-length field is read, a length of null field length in the source field indicates that it contains a null. This property is mutually exclusive with null field value. Null field value. Specifies the value given to a null field if the source is set to null. Can be a number, string, or C-type literal escape character. For example, you can represent a byte value by \ooo, where each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F You must use this . form to encode non-printable byte values. This property is mutually exclusive with Null field length and Actual length. For a fixed width data representation, you can use Pad char (from the general section of Type defaults) to specify a repeated trailing character if the value you specify is shorter than the fixed width of the field. Prefix bytes. You can use this option with variable-length fields. Variable-length fields can be either delimited by a character or preceded by a 1-, 2-, or 4-byte prefix containing the field length. DataStage reads the length prefix but does not include the prefix as a separate field in the data set it reads from the file. This property is mutually exclusive with the Delimiter, Quote, and Final Delimiter properties, which are used by default. Print field. This property is intended for use when debugging jobs. Set it to have DataStage produce a message for every field it reads. The message has the format:
Importing N: D

where:

N is the field name. D is the imported data of the field. Non-printable characters conained in D are prefixed with an escape character and written as C string literals; if the field contains binary data, it is output in octal format.

Quote. Specifies that variable length fields are enclosed in single quotes, double quotes, or another character or pair of characters. Choose Single or Double, or enter a character. This is set to double quotes by default. When reading, DataStage ignores the leading quote character and reads all bytes up to but not including the trailing quote character. Vector prefix. For fields that are variable length vectors, specifies that a 1-, 2-, or 4-byte prefix contains the number of elements in the vector. You can override this default prefix for individual vectors.

5-34

Parallel Job Developers Guide

Sequential File Stage

Outputs Page

Variable-length vectors must use either a prefix on the vector or a link to another field in order to specify the number of elements in the vector. If the variable length vector has a prefix, you use this property to indicate the prefix length. DataStage reads the length prefix but does not include it as a separate field in the data set. By default, the prefix length is assumed to be one byte. Type Defaults These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type. General These properties apply to several data types (unless overridden at column level): Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

little-endian. The high byte is on the right. big-endian. The high byte is on the left. native-endian. As defined by the native format of the machine. This is the default.

Data Format. Specifies the data representation format of a field. Applies to fields of all data types except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that is neither string nor raw. Choose from:

binary text (the default)

A setting of binary has different meanings when applied to different data types:

For decimals, binary means packed. For other numerical data types, binary means not text . For dates, binary is equivalent to specifying the julian property for the date field. For time, binary is equivalent to midnight_seconds. For timestamp, binary specifies that the first integer contains a Julian day count for the date portion of the timestamp and the second integer specifies the time portion of the timestamp as the number of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.

Parallel Job Developers Guide

5-35

Outputs Page

Sequential File Stage

By default data is formatted as text, as follows:

For the date data type, text specifies that the data read, contains a text-based date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the decimal data type: a field represents a decimal in a string format with a leading space or '-' followed by decimal digits with an embedded decimal point if the scale is not zero. The destination string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored. For numeric fields (int8, int16, int32, uint8, uint16, uint32, sfloat, and dfloat): DataStage assumes that numeric fields are represented as text. For the time data type: text specifies that the field represents time in the text-based form %hh:%nn:%ss or or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the timestamp data type: text specifies a text-based timestamp in the form %yyyy-%mm-%dd %hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see NLS Guide).

(See page 2-28 for a description of data types.) Field max width. The maximum number of bytes in a column represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width character set, you can calculate the length exactly. If you are using variable-length character set, calculate an adequate maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. Field width. The number of bytes in a field represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the number of bytes exactly. If its a variable length encoding, base your calculation on the width and frequency of your variable-width characters. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. If you specify neither field width nor field max width, numeric fields written as text have the following number of bytes as their maximum width:

8-bit signed or unsigned integers: 4 bytes 16-bit signed or unsigned integers: 6 bytes

5-36

Parallel Job Developers Guide

Sequential File Stage

Outputs Page

32-bit signed or unsigned integers: 11 bytes 64-bit signed or unsigned integers: 21 bytes single-precision float: 14 bytes (sign, digit, decimal point, 7 fraction, "E", sign, 2 exponent) double-precision float: 24 bytes (sign, digit, decimal point, 16 fraction, "E", sign, 3 exponent)

Pad char. This property is ignored for output links. Character set. Specifies the character set. Choose from ASCII or EBCDIC. The default is ASCII. Applies to all data types except raw and ustring and record, subrec, or tagged containing no fields other than raw or ustring. String These properties are applied to columns with a string data type, unless overridden at column level. Export EBCDIC as ASCII. Not relevant for output links. Import ASCII as EBCDIC. Select this to specify that ASCII characters are read as EBCDIC characters. For ASCII-EBCDIC and EBCDIC-ASCII coversion tables, see DataStage Developers Help. Decimal These properties are applied to columns with a decimal data type unless overridden at column level. Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No. The default is No. Decimal separator. Specify the ASCII character that acts as the decimal separator (period by default). Packed. Select an option to specify what the decimal columns contain, choose from:

Yes to specify that the decimal fields contain data in packed decimal format (the default). This has the following subproperties: Check. Select Yes to verify that data is packed, or No to not verify. Signed. Select Yes to use the existing sign when reading decimal fields. Select No to write a positive sign (0xf) regardless of the fields actual sign value.

Parallel Job Developers Guide

5-37

Outputs Page

Sequential File Stage

No (separate) to specify that they contain unpacked decimal with a separate sign byte. This has the following sub-property: Sign Position. Choose leading or trailing as appropriate.

No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This has the following subproperty: Sign Position. Choose leading or trailing as appropriate.

No (overpunch) to specify that the field has a leading or end byte that contains a character which specifies both the numeric value of that byte and whether the number as a whole is negatively or positively signed. This has the following subproperty: Sign Position. Choose leading or trailing as appropriate.

Precision. Specifies the precision of a packed decimal. Enter a number. Rounding. Specifies how to round the source field to fit into the destination decimal when reading a source field to a decimal. Choose from:

up (ceiling). Truncate source column towards positive infinity. This mode corresponds to the IEEE 754 Round Up mode. For example, 1.4 becomes 2, -1.6 becomes -1. down (floor). Truncate source column towards negative infinity. This mode corresponds to the IEEE 754 Round Down mode. For example, 1.6 becomes 1, -1.4 becomes -2. nearest value. Round the source column towards the nearest representable value. This mode corresponds to the COBOL ROUNDED mode. For example, 1.4 becomes 1, 1.5 becomes 2, 1.4 becomes -1, -1.5 becomes -2. truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, truncate to the scale size of the destination decimal. This mode corresponds to the COBOL INTEGER-PART function. Using this method 1.6 becomes 1, -1.6 becomes -1.

Scale. Specifies the scale of a source packed decimal. Numeric These properties apply to integer and float fields unless overridden at column level.

5-38

Parallel Job Developers Guide

Sequential File Stage

Outputs Page

C_format. Perform non-default conversion of data from string data to a integer or floating-point. This property specifies a Clanguage format string used for reading integer or floating point strings. This is passed to sscanf(). For example, specifying a Cformat of %x and a field width of 8 ensures that a 32-bit integer is formatted as an 8-byte hexadecimal string. In_format. Format string used for conversion of data from string to integer or floating-point data This is passed to sscanf(). By default, DataStage invokes the C sscanf() function to convert a numeric field formatted as a string to either integer or floating point data. If this function does not output data in a satisfactory format, you can specify the in_format property to pass formatting arguments to sscanf(). Out_format. This property is not relevant for output links. Date These properties are applied to columns with a date data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text. Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). Format string. The string format of a date. By default this is %yyyy-%mm-%dd. The Format string can contain one or a combination of the following elements:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff, for example %1970yy. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 366). %mmm: Three-character month abbreviation.

The format_string is subject to the following restrictions:


It cannot have more than one element of the same type, for example it cannot contain two %dd elements. It cannot have both %dd and %ddd. It cannot have both %yy and %yyyy. It cannot have both %mm and %ddd.
5-39

Parallel Job Developers Guide

Outputs Page

Sequential File Stage

It cannot have both %mmm and %ddd. It cannot have both %mm and %mmm. If it has %dd, it must have %mm or %mmm. It must have exactly one of %yy or %yyyy.

When you specify a date format string, prefix each component with the percent symbol (%). Separate the strings components with any character except the percent sign (%). If this format string does not include a day, it is set to the first of the month in the destination field. If the format string does not include the month and day, they default to January 1. Note that the format string must contain a month if it also contains a day; that is, you cannot omit only the month. The year_cutoff is the year defining the beginning of the century in which all two digit years fall. By default, the year cutoff is 1900; therefore, a two-digit year of 97 represents 1997. You can also set this using the environment variable APT_DATE_CENTURY_BREAK_YEAR (see "APT_DATE_CENTURY_BREAK_YEAR" in Parallel Job Advanced Developers Guide), but this is overridden by %year_cutoffyy if you have set it. You can specify any four-digit year as the year cutoff. All two-digit years then specify the next possible year ending in the specified two digits that is the same or greater than the cutoff. For example, if you set the year cutoff to 1930, the twodigit year 30 corresponds to 1930, and the two-digit year 29 corresponds to 2029. Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT. Time These properties are applied to columns with a time data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text. Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss. The possible components of the time format string are:

%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date).

5-40

Parallel Job Developers Guide

Sequential File Stage

Outputs Page

%ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent symbol. Separate the strings components with any character except the percent sign (%). Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight. Timestamp These properties are applied to columns with a timestamp data type unless overridden at column level. Format string. Specifies the format of a column representing a timestamp as a string. Defaults to %yyyy-%mm-%dd %hh:%nn:%ss. Specify the format as follows: For the date:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 - 366).

For the time:


%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent symbol (%). Separate the strings components with any character except the percent sign (%).

Parallel Job Developers Guide

5-41

Using RCP With Sequential Stages

Sequential File Stage

Using RCP With Sequential Stages


Runtime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between. Sequential files, unlike most other data sources, do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on sequential files if you have used the Schema File property (see "Schema File" on page 5-10 and on page 5-29) to specify a schema which describes all the columns in the sequential file. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are: Sequential File File Set External Source External Target Column Import Column Export

5-42

Parallel Job Developers Guide

6
File Set Stage
The File Set stage is a file stage. It allows you to read data from or write data to a file set. The stage can have a single input link, a single output link, and a single rejects link. It only executes in parallel mode. What is a file set? DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns. The amount of data that can be stored in each destination data file is limited by the characteristics of the file system and the amount of free disk space available. The number of files created by a file set depends on: The number of processing nodes in the default node pool The number of disks in the export or default disk pool connected to each processing node in the default node pool The size of the partitions of the data set

Parallel Job Developers Guide

6-1

Must Dos

File Set Stage

The File Set stage enables you to create and write to file sets, and to read data back from file sets.

Unlike data sets, file sets carry formatting information that describe the format of the files to be read or written. When you edit a File Set stage, the File Set stage editor appears. This is based on the generic stage editor described in Chapter 3, "Stage Editors." The stage editor has up to three pages, depending on whether you are reading or writing a file set: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is present when you are writing to a file set. This is where you specify details about the file set being written to. Outputs Page. This is present when you are reading from a file set. This is where you specify details about the file set being read from. There are one or two special points to note about using runtime column propagation (RCP) with File Set stages. See "Using RCP With File Set Stages" on page 6-37 for details.

Must Dos
DataStage has many defaults which means that it can be very easy to include File Set stages in a job. This section specifies the minimum steps to take to get a File Set stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic methods, you will learn where the shortcuts are when you get familiar with the product.

6-2

Parallel Job Developers Guide

File Set Stage

Stage Page

The steps required depend on whether you are using the File Set stage to read or write a file.

Writing to a File
In the Input Link Properties Tab specify the pathname of the file set being written to. The other properties all have default values, which you can change or not as required. In the Input Link Format Tab specify format details for the file set you are writing to, or accept the defaults (variable length columns enclosed in double quotes and delimited by commas, rows delimited with UNIX newlines). Ensure column meta data has been specified for the file set.

Reading from a File


In the Output Link Properties Tab specify the pathname of the file set being read from. The other properties all have default values, which you can change or not as required. In the Output Link Format Tab specify format details for the file set you are reading from, or accept the defaults (variable length columns enclosed in double quotes and delimited by commas, rows delimited with UNIX newlines). Ensure column meta data has been specified for the file set (this can be achieved via a schema file if required).

Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes. The NLS Map tab appears if you have NLS enabled on your system, it allows you to specify a character set map for the stage.

Advanced Tab
This tab allows you to specify the following: Execution Mode. This is set to parallel and cannot be changed. Combineability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage.

Parallel Job Developers Guide

6-3

Stage Page

File Set Stage

Preserve partitioning. You can select Set or Clear. If you select Set, file set read operations will request that the next stage preserves the partitioning as is (it is ignored for file set write operations). Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

NLS Map Tab


The NLS Map tab allows you to define a character set map for the File Set stage. This overrides the default character set map set for the project or the job. You can specify that the map be supplied as a job parameter if required. You can also select Allow per-column mapping. This allows character set maps to be specified for individual columns within the data processed by the File Set stage An extra property, NLS Map, appears in the Columns grid in the Columns tab, but note that only ustring data types allow you to set an NLS map value (see "Data Types" on page 2-28).

6-4

Parallel Job Developers Guide

File Set Stage

Inputs Page

Inputs Page
The Inputs page allows you to specify details about how the File Set stage writes data to a file set. The File Set stage can have only one input link. The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the file set. The Formats tab gives information about the format of the files being written. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the input link. Details about File Set stage properties, partitioning, and formatting are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Input Link Properties Tab


The Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and to what file set. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Target/File Set Target/File Set Update Policy

Values
pathname Create (Error if exists) / Overwrite/ Use Existing (Discard records)/ Use Existing (Discard schema & records) Write/Omit True/False

Default
N/A Overwrite

Mandatory? Repeats?
Y Y N N

Dependent of
N/A N/A

Target/File Set Schema policy Options/Cleanup on Failure

Write True

Y Y

N N

N/A N/A

Parallel Job Developers Guide

6-5

Inputs Page

File Set Stage

Category/ Property
Options/Single File Per Partition. Options/Reject Mode Options/Diskpool Options/File Prefix Options/File Suffix Options/Maximum File Size Options/Schema File

Values
True/False

Default
False

Mandatory? Repeats?
Y Y N N N N N N N N

Dependent of
N/A N/A N/A N/A N/A N/A N/A

Continue/Fail/ Continue Save string string string number MB pathname N/A

export.use N rname none N/A N/A N N N

Target Category
File Set This property defines the file set that the incoming data will be written to. You can type in a pathname of, or browse for a file set descriptor file (by convention ending in .fs). File Set Update Policy Specifies what action will be taken if the file set you are writing to already exists. Choose from: Create (Error if exists) Overwrite Use Existing (Discard records) Use Existing (Discard schema & records) The default is Overwrite. File Set Schema policy Specifies whether the schema should be written to the file set. Choose from Write or Omit. The default is Write.

6-6

Parallel Job Developers Guide

File Set Stage

Inputs Page

Options Category
Cleanup on Failure This is set to True by default and specifies that the stage will delete any partially written files if the stage fails for any reason. Set this to False to specify that partially written files should be left. Single File Per Partition. Set this to True to specify that one file is written for each partition. The default is False. Reject Mode Allows you to specify behavior if a record fails to be written for some reason. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue. Diskpool This is an optional property. Specify the name of the disk pool into which to write the file set. You can also specify a job parameter. File Prefix This is an optional property. Specify a prefix for the name of the file set components. If you do not specify a prefix, the system writes the following: export.username, where username is your login. You can also specify a job parameter. File Suffix This is an optional property. Specify a suffix for the name of the file set components. The suffix is omitted by default. Maximum File Size This is an optional property. Specify the maximum file size in MB. The value must be equal to or greater than 1. Schema File This is an optional property. By default the File Set stage will use the column definitions defined on the Columns tab and formatting information from the Format tab as a schema for writing the file. You can, however, specify a file containing a schema instead (note, however, that if you have defined columns on the Columns tab, you

Parallel Job Developers Guide

6-7

Inputs Page

File Set Stage

should ensure these match the schema file). Type in a pathname or browse for a schema file.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the file set. It also allows you to specify that the data should be sorted before being written. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the File Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the File Set stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the File Set stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the File Set stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the File Set stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

6-8

Parallel Job Developers Guide

File Set Stage

Inputs Page

Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default method for the File Set stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the file or files. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default.

Parallel Job Developers Guide

6-9

Inputs Page

File Set Stage

Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Input Link Format Tab


The Format tab allows you to supply information about the format of the files in the file set to which you are writing. The tab has a similar format to the Properties tab and is described on page 3-25. If you do not alter any of the Format settings, the File Set stage will produce files of the following format: Files comprise variable length columns contained within double quotes. All columns are delimited by a comma, except for the final column in a row. Rows are delimited by a UNIX newline. You can use the Format As item from the shortcut menu in the Format Tab to quickly change to a fixed-width column format, using DOS newlines as row delimiters, or producing a COBOL format file. To change individual properties, select a property type from the main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to set window. You can then set a value for that property in the Property Value box. Popup help for each of the available properties appears if you hover the mouse pointer over it. Any property that you set on this tab can be overridden at the column level by setting properties for individual columns on the Edit Column Metadata dialog box (see page 3-26). This description uses the terms record and row and field and column interchangeably. The following sections list the Property types and properties available for each type.

6-10

Parallel Job Developers Guide

File Set Stage

Inputs Page

Record level These properties define details about how data records are formatted in the flat file. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Fill char. Specify an ASCII character or a value in the range 0 to 255. You can also choose Space or Null from a drop-down list. This character is used to fill any gaps in a written record caused by column positioning properties. Set to 0 by default (which is the NULL character). For example, to set it to space you could also type in the space character or enter 32. Note that this value is restricted to one byte, so you cannot specify a multi-byte Unicode character. Final delimiter string. Specify a string to be written after the last column of a record in place of the column delimiter. Enter one or more characters, this precedes the record delimiter if one is used. Mutually exclusive with Final delimiter, which is the default. For example, if you set Delimiter to comma (see under "Field Defaults" for Delimiter) and Final delimiter string to , (comma space you do not need to enter the inverted commas) all fields are delimited by a comma, except the final field, which is delimited by a comma followed by an ASCII space character. Final delimiter. Specify a single character to be written after the last column of a record in place of the field delimiter. Type a character or select one of whitespace, end, none, null, tab, or comma. See the following diagram for an illustration.

whitespace. The last column of each record will not include any trailing white spaces found at the end of the record. end. The last column of each record does not include the field delimiter. This is the default setting. none. The last column of each record does not have a delimiter; used for fixed-width fields. null. The last column of each record is delimited by the ASCII null character. comma. The last column of each record is delimited by the ASCII comma character.

Parallel Job Developers Guide

6-11

Inputs Page

File Set Stage

tab. The last column of each record is delimited by the ASCII tab character.
Record delimiter Field 1 , Field 1 , Field 1 , Last field nl

Final Delimiter = end Field Delimiter

Field 1

Field 1

Field 1

, Last field

, nl

Final Delimiter = comma

When writing, a space is now inserted after every field except the last in the record. Previously, a space was inserted after every field including the last. (If you want to revert to the pre-release 7.5 behavior of inserting a space after the last field, set the APT_FINAL_DELIM_COMPATIBLE environment variable. Intact. The intact property specifies an identifier of a partial schema. A partial schema specifies that only the column(s) named in the schema can be modified by the stage. All other columns in the row are passed through unmodified. (See "Partial Schemas" in Appendix A for details.) The file containing the partial schema is specified in the Schema File property on the Properties tab (see page 5-9). This property has a dependent property, Check intact, but this is not relevant to input links. Record delimiter string. Specify a string to be written at the end of each record. Enter one or more characters. This is mutually exclusive with Record delimiter, which is the default, record type and record prefix. Record delimiter. Specify a single character to be written at the end of each record. Type a character or select one of the following:

UNIX Newline (the default) null

(To implement a DOS newline, use the Record delimiter string property set to \R\N or choose Format as DOS line terminator from the shortcut menu.) Record delimiter is mutually exclusive with Record delimiter string, Record prefix, and Record type. Record length. Select Fixed where fixed length fields are being written. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes. This is not used by default (default files are commadelimited). The record is padded to the specified length with either zeros or the fill character if one has been specified.

6-12

Parallel Job Developers Guide

File Set Stage

Inputs Page

Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. It is set to 1 by default. This is mutually exclusive with Record delimiter, which is the default, and record delimiter string and record type. Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is written as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, VBS, or VR. This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix and by default is not used. Field Defaults Defines default properties for columns written to the file or files. These are applied to all columns written, but can be overridden for individual columns from the Columns tab using the Edit Column Metadata dialog box. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Actual field length. Specifies the number of bytes to fill with the Fill character when a field is identified as null. When DataStage identifies a null field, it will write a field of this length full of Fill characters. This is mutually exclusive with Null field value. Delimiter. Specifies the trailing delimiter of all fields in the record. Type an ASCII character or select one of whitespace, end, none, null, comma, or tab.

whitespace. Whitespace characters at the end of a column are ignored, i.e., are not treated as part of the column. end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the same as a setting of None which is used for fields with fixed-width columns. none. No delimiter (used for fixed-width). null. ASCII Null character is used. comma. ASCII comma character is used. tab. ASCII tab character is used.

Delimiter string. Specify a string to be written at the end of each field. Enter one or more characters. This is mutually exclusive with Delimiter, which is the default. For example, specifying , (comma

Parallel Job Developers Guide

6-13

Inputs Page

File Set Stage

space you do not need to enter the inverted commas) would have each field delimited by , unless overridden for individual fields. Null field length. The length in bytes of a variable-length field that contains a null. When a variable-length field is written, DataStage writes a length value of null field length if the field contains a null. This property is mutually exclusive with null field value. Null field value. Specifies the value written to null field if the source is set to null. Can be a number, string, or C-type literal escape character. For example, you can represent a byte value by \ooo, where each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F You must use this . form to encode non-printable byte values. This property is mutually exclusive with Null field length and Actual length. For a fixed width data representation, you can use Pad char (from the general section of Type defaults) to specify a repeated trailing character if the value you specify is shorter than the fixed width of the field. Prefix bytes. Specifies that each column in the data file is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the columns length or the tag value for a tagged field. You can use this option with variable-length fields. Variable-length fields can be either delimited by a character or preceded by a 1-, 2, or 4-byte prefix containing the field length. DataStage inserts the prefix before each field. This property is mutually exclusive with the Delimiter, Quote, and Final Delimiter properties, which are used by default. Print field. This property is not relevant for input links. Quote. Specifies that variable length fields are enclosed in single quotes, double quotes, or another character or pair of characters. Choose Single or Double, or enter a character. This is set to double quotes by default. When writing, DataStage inserts the leading quote character, the data, and a trailing quote character. Quote characters are not counted as part of a fields length. Vector prefix. For fields that are variable length vectors, specifies a 1-, 2-, or 4-byte prefix containing the number of elements in the vector. You can override this default prefix for individual vectors. Variable-length vectors must use either a prefix on the vector or a link to another field in order to specify the number of elements in

6-14

Parallel Job Developers Guide

File Set Stage

Inputs Page

the vector. If the variable length vector has a prefix, you use this property to indicate the prefix length. DataStage inserts the element count as a prefix of each variable-length vector field. By default, the prefix length is assumed to be one byte. Type Defaults These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type. General These properties apply to several data types (unless overridden at column level): Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

little-endian. The high byte is on the right. big-endian. The high byte is on the left. native-endian. As defined by the native format of the machine. This is the default.

Data Format. Specifies the data representation format of a field. Applies to fields of all data types except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that is neither string nor raw. Choose from:

binary text (the default)

A setting of binary has different meanings when applied to different data types:

For decimals, binary means packed. For other numerical data types, binary means not text . For dates, binary is equivalent to specifying the julian property for the date field. For time, binary is equivalent to midnight_seconds. For timestamp, binary specifies that the first integer contains a Julian day count for the date portion of the timestamp and the second integer specifies the time portion of the timestamp as the number of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.

By default data is formatted as text, as follows:

Parallel Job Developers Guide

6-15

Inputs Page

File Set Stage

For the date data type, text specifies that the data to be written contains a text-based date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the decimal data type: a field represents a decimal in a string format with a leading space or '-' followed by decimal digits with an embedded decimal point if the scale is not zero. The destination string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored. For numeric fields (int8, int16, int32, uint8, uint16, uint32, sfloat, and dfloat): DataStage assumes that numeric fields are represented as text. For the time data type: text specifies that the field represents time in the text-based form %hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the timestamp data type: text specifies a text-based timestamp in the form %yyyy-%mm-%dd %hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see NLS Guide).

(See page 2-28 for a description of data types.) Field max width. The maximum number of bytes in a column represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width character set, you can calculate the length exactly. If you are using variable-length character set, calculate an adequate maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. Field width. The number of bytes in a field represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the number of bytes exactly. If its a variable length encoding, base your calculation on the width and frequency of your variable-width characters. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. If you specify neither field width nor field max width, numeric fields written as text have the following number of bytes as their maximum width:

8-bit signed or unsigned integers: 4 bytes 16-bit signed or unsigned integers: 6 bytes 32-bit signed or unsigned integers: 11 bytes

6-16

Parallel Job Developers Guide

File Set Stage

Inputs Page

64-bit signed or unsigned integers: 21 bytes single-precision float: 14 bytes (sign, digit, decimal point, 7 fraction, "E", sign, 2 exponent) double-precision float: 24 bytes (sign, digit, decimal point, 16 fraction, "E", sign, 3 exponent)

Pad char. Specifies the pad character used when strings or numeric values are written to an external string representation. Enter a character (single-byte for strings, can be multi-byte for ustrings) or choose null or space. The pad character is used when the external string representation is larger than required to hold the written field. In this case, the external string is filled with the pad character to its full length. Space is the default. Applies to string, ustring, and numeric data types and record, subrec, or tagged types if they contain at least one field of this type. Character set. Specifies the character set. Choose from ASCII or EBCDIC. The default is ASCII. Applies to all data types except raw and ustring and record, subrec, or tagged containing no fields other than raw or ustring. String These properties are applied to columns with a string data type, unless overridden at column level. Export EBCDIC as ASCII. Select this to specify that EBCDIC characters are written as ASCII characters. Applies to fields of the string data type and record, subrec, or tagged fields if they contain at least one field of this type. Import ASCII as EBCDIC. Not relevant for input links. For ASCII-EBCDIC and EBCDIC-ASCII conversion tables, see DataStage Developers Help. Decimal These properties are applied to columns with a decimal data type unless overridden at column level. Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No. The default is No. Decimal separator. Specify the ASCII character that acts as the decimal separator (period by default). Packed. Select an option to specify what the decimal columns contain, choose from:

Parallel Job Developers Guide

6-17

Inputs Page

File Set Stage

Yes to specify that the decimal columns contain data in packed decimal format (the default). This has the following subproperties:

Check. Select Yes to verify that data is packed, or No to not verify. Signed. Select Yes to use the existing sign when writing decimal columns. Select No to write a positive sign (0xf) regardless of the columns actual sign value.

No (separate) to specify that they contain unpacked decimal with a separate sign byte. This has the following sub-property:

Sign Position. Choose leading or trailing as appropriate.

No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This has the following subproperty:

Sign Position. Choose leading or trailing as appropriate.

No (overpunch) to specify that the field has a leading or end byte that contains a character which specifies both the numeric value of that byte and whether the number as a whole is negatively or positively signed. This has the following subproperty:

Sign Position. Choose leading or trailing as appropriate. Precision. Specifies the precision where a decimal column is written in text format. Enter a number. When a decimal is written to a string representation, DataStage uses the precision and scale defined for the source decimal field to determine the length of the destination string. The precision and scale properties override this default. When they are defined, DataStage truncates or pads the source decimal to fit the size of the destination string. If you have also specified the field width property, DataStage truncates or pads the source decimal to fit the size specified by field width. Rounding. Specifies how to round a decimal column when writing it. Choose from:

up (ceiling). Truncate source column towards positive infinity. This mode corresponds to the IEEE 754 Round Up mode. For example, 1.4 becomes 2, -1.6 becomes -1. down (floor). Truncate source column towards negative infinity. This mode corresponds to the IEEE 754 Round Down mode. For example, 1.6 becomes 1, -1.4 becomes -2.

6-18

Parallel Job Developers Guide

File Set Stage

Inputs Page

nearest value. Round the source column towards the nearest representable value. This mode corresponds to the COBOL ROUNDED mode. For example, 1.4 becomes 1, 1.5 becomes 2, 1.4 becomes -1, -1.5 becomes -2. truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, truncate to the scale size of the destination decimal. This mode corresponds to the COBOL INTEGER-PART function. Using this method 1.6 becomes 1, -1.6 becomes -1.

Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination. By default, when the DataStage writes a source decimal to a string representation, it uses the precision and scale defined for the source decimal field to determine the length of the destination string. You can override the default by means of the precision and scale properties. When you do, DataStage truncates or pads the source decimal to fit the size of the destination string. If you have also specified the field width property, DataStage truncates or pads the source decimal to fit the size specified by field width. Numeric These properties apply to integer and float fields unless overridden at column level. C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a Clanguage format string used for writing integer or floating point strings. This is passed to sprintf(). For example, specifying a Cformat of %x and a field width of 8 ensures that integers are written as 8-byte hexadecimal strings. In_format. This property is not relevant for input links.. Out_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sprintf(). By default, DataStage invokes the C sprintf() function to convert a numeric field formatted as either integer or floating point data to a string. If this function does not output data in a satisfactory format, you can specify the out_format property to pass formatting arguments to sprintf(). Date These properties are applied to columns with a date data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text.

Parallel Job Developers Guide

6-19

Inputs Page

File Set Stage

Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). Format string. The string format of a date. By default this is %yyyy-%mm-%dd. The Format string can contain one or a combination of the following elements:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff, for example %1970yy. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 366). %mmm: Three-character month abbreviation.

The format_string is subject to the following restrictions:


It cannot have more than one element of the same type, for example it cannot contain two %dd elements. It cannot have both %dd and %ddd. It cannot have both %yy and %yyyy. It cannot have both %mm and %ddd. It cannot have both %mmm and %ddd. It cannot have both %mm and %mmm. If it has %dd, it must have %mm or %mmm. It must have exactly one of %yy or %yyyy.

When you specify a date format string, prefix each component with the percent symbol (%). Separate the strings components with any character except the percent sign (%). If this format string does not include a day, it is set to the first of the month in the destination field. If the format string does not include the month and day, they default to January 1. Note that the format string must contain a month if it also contains a day; that is, you cannot omit only the month. The year_cutoff is the year defining the beginning of the century in which all two digit years fall. By default, the year cutoff is 1900; therefore, a two-digit year of 97 represents 1997. You can also set this using the environment variable

6-20

Parallel Job Developers Guide

File Set Stage

Inputs Page

APT_DATE_CENTURY_BREAK_YEAR (see "APT_DATE_CENTURY_BREAK_YEAR" in Parallel Job Advanced Developers Guide), but this is overridden by %year_cutoffyy if you have set it. You can specify any four-digit year as the year cutoff. All two-digit years then specify the next possible year ending in the specified two digits that is the same or greater than the cutoff. For example, if you set the year cutoff to 1930, the two-digit year 30 corresponds to 1930, and the two-digit year 29 corresponds to 2029. Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT. Time These properties are applied to columns with a time data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text. Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss. The possible components of the time format string are:

%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent symbol. Separate the strings components with any character except the percent sign (%). Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight. Timestamp These properties are applied to columns with a timestamp data type unless overridden at column level.

Parallel Job Developers Guide

6-21

Outputs Page

File Set Stage

Format string. Specifies the format of a column representing a timestamp as a string. Defaults to %yyyy-%mm-%dd %hh:%nn:%ss. Specify the format as follows: For the date:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 - 366)

For the time:


%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent sign (%). Separate the strings components with any character except the percent sign (%).

Outputs Page
The Outputs page allows you to specify details about how the File Set stage reads data from a file set. The File Set stage can have only one output link. It can also have a single reject link, where rows that have failed to be written or read for some reason can be sent. The Output name drop-down list allows you to choose whether you are looking at details of the main output link (the stream link) or the reject link. The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Formats tab gives information about the format of the files being read. The Columns tab specifies the

6-22

Parallel Job Developers Guide

File Set Stage

Outputs Page

column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. Details about File Set stage properties and formatting are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Output Link Properties Tab


The Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from files in the file set. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Source/File Set Options/Keep file Partitions

Values
pathname True/False

Default
N/A False

Mandatory?
Y Y Y Y N N Y N

Repeats?
N N N N N N N N

Dependent of
N/A N/A N/A N/A N/A N/A N/A N/A

Options/Reject Mode Continue/Fail/ Continue Save Options/Report Progress Options/Filter Options/Schema File Yes/No command pathname Yes N/A N/A False

Options/Use Schema True/False Defined in File Set Options/File Name Column

column name fileNameColumn

Source Category
File Set This property defines the file set that the data will be read from. You can type in a pathname of, or browse for, a file set descriptor file (by convention ending in .fs).

Parallel Job Developers Guide

6-23

Outputs Page

File Set Stage

Options Category
Keep file Partitions Set this to True to partition the read data set according to the organization of the input file(s). So, for example, if you are reading three files you will have three partitions. Defaults to False. Reject Mode Allows you to specify behavior for read rows that do not match the expected schema. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue. Report Progress Choose Yes or No to enable or disable reporting. By default the stage displays a progress report at each 10% interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB, records are fixed length, and there is no filter on the file. Filter This is an optional property. You can use this to specify that the data is passed through a filter program after being read from the files. Specify the filter command, and any required arguments, in the Property Value box. Schema File This is an optional property. By default the File Set stage will use the column definitions defined on the Columns and Format tabs as a schema for reading the file. You can, however, specify a file containing a schema instead (note, however, that if you have defined columns on the Columns tab, you should ensure these match the schema file). Type in a pathname or browse for a schema file. This property is mutually exclusive with Use Schema Defined in File Set. Use Schema Defined in File Set When you create a file set you have an option to save the schema along with it. When you read the file set you can use this schema in preference to the column definitions by setting this property to True. This property is mutually exclusive with Schema File.

6-24

Parallel Job Developers Guide

File Set Stage

Outputs Page

File Name Column This is an optional property. It adds an extra column of type VarChar to the output of the stage, containing the pathname of the file the record is read from. You should also add this column manually to the Columns definitions to ensure that the column is not dropped if you are not using runtime column propagation, or it is turned off at some point.

Reject Link Properties


You cannot change the properties of a Reject link. The Properties tab for a reject link is blank. Similarly, you cannot edit the column definitions for a reject link. For writing file sets, the link uses the column definitions for the input link. For reading file sets, the link uses a single column called rejected containing raw data for columns rejected after reading because they do not match the schema.

Output Link Format Tab


The Format tab allows you to supply information about the format of the files in the file set which you are reading. The tab has a similar format to the Properties tab and is described on page 3-25. If you do not alter any of the Format settings, the File Set stage will expect to write files in the following format: Files comprise variable length columns contained within double quotes. All columns are delimited by a comma, except for the final column in a row. Rows are delimited by a UNIX newline. You can use the Format As item from the shortcut menu in the Format Tab to quickly change to a fixed-width column format, using DOS newlines as row delimiters, or producing a COBOL format file. Select a property type from the main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to set window. You can then set a value for that property in the Property Value box. Pop-up help for each of the available properties appears if you hover the mouse pointer over it. Any property that you set on this tab can be overridden at the column level by setting properties for individual columns on the Edit Column Metadata dialog box (see page 3-26).

Parallel Job Developers Guide

6-25

Outputs Page

File Set Stage

This description uses the terms record and row and field and column interchangeably. The following sections list the Property types and properties available for each type. Record level These properties define details about how data records are formatted in the flat file. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Fill char. Does not apply to output links. Final delimiter string. Specify the string written after the last column of a record in place of the column delimiter. Enter one or more characters, this precedes the record delimiter if one is used. Mutually exclusive with Final delimiter, which is the default. For example, if you set Delimiter to comma (see under "Field Defaults" for Delimiter) and Final delimiter string to , (comma space you do not need to enter the inverted commas) all fields are delimited by a comma, except the final field, which is delimited by a comma followed by an ASCII space character. DataStage skips the specified delimiter string when reading the file. Final delimiter. Specify the single character written after the last column of a record in place of the field delimiter. Type a character or select one of whitespace, end, none, null, tab, or comma. DataStage skips the specified delimiter string when reading the file. See the following diagram for an illustration.

whitespace. The last column of each record will not include any trailing white spaces found at the end of the record. end. The last column of each record does not include the field delimiter. This is the default setting. none. The last column of each record does not have a delimiter, used for fixed-width fields. null. The last column of each record is delimited by the ASCII null character. comma. The last column of each record is delimited by the ASCII comma character.

6-26

Parallel Job Developers Guide

File Set Stage

Outputs Page

tab. The last column of each record is delimited by the ASCII tab character.
Record delimiter Field 1 , Field 1 , Field 1 , Last field nl

Final Delimiter = end Field Delimiter

Field 1

Field 1

Field 1

, Last field

, nl

Final Delimiter = comma

Intact. The intact property specifies an identifier of a partial schema. A partial schema specifies that only the column(s) named in the schema can be modified by the stage. All other columns in the row are passed through unmodified. (See "Partial Schemas" in Appendix A for details.) The file containing the partial schema is specified in the Schema File property on the Outputs tab. This property has a dependent property:

Check intact. Select this to force validation of the partial schema as the file or files are imported. Note that this can degrade performance.

Record delimiter string. Specify the string at the end of each record. Enter one or more characters. This is mutually exclusive with Record delimiter, which is the default, and record type and record prefix. Record delimiter. Specify the single character at the end of each record. Type a character or select one of the following:

UNIX Newline (the default) null

(To specify a DOS newline, use the Record delimiter string property set to \R\N or choose Format as DOS line terminator from the shortcut menu.) Record delimiter is mutually exclusive with Record delimiter string, Record prefix, and record type. Record length. Select Fixed where fixed length fields are being read. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes. This is not used by default (default files are commadelimited). Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. It is set to 1 by default. This is mutually exclusive with Record delimiter, which is the default, and record delimiter string and record type.

Parallel Job Developers Guide

6-27

Outputs Page

File Set Stage

Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is written as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, VBS, or VR. This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix and by default is not used. Field Defaults Defines default properties for columns read from the file or files. These are applied to all columns, but can be overridden for individual columns from the Columns tab using the Edit Column Metadata dialog box. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Actual field length. Specifies the actual number of bytes to skip if the fields length equals the setting of the null field length property. Delimiter. Specifies the trailing delimiter of all fields in the record. Type an ASCII character or select one of whitespace, end, none, null, comma, or tab. DataStage skips the delimiter when reading.

whitespace. Whitespace characters at the end of a column are ignored, i.e., are not treated as part of the column. end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the same as a setting of None which is used for fields with fixed-width columns. none. No delimiter (used for fixed-width). null. ASCII Null character is used. comma. ASCII comma character is used. tab. ASCII tab character is used.

Delimiter string. Specify the string at the end of each field. Enter one or more characters. This is mutually exclusive with Delimiter, which is the default. For example, specifying , (comma space you do not need to enter the inverted commas) specifies each field is delimited by , unless overridden for individual fields. DataStage skips the delimiter string when reading.

6-28

Parallel Job Developers Guide

File Set Stage

Outputs Page

Null field length. The length in bytes of a variable-length field that contains a null. When a variable-length field is read, a length of null field length in the source field indicates that it contains a null. This property is mutually exclusive with null field value. Null field value. Specifies the value given to a null field if the source is set to null. Can be a number, string, or C-type literal escape character. For example, you can represent a byte value by \ooo, where each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F You must use this . form to encode non-printable byte values. This property is mutually exclusive with Null field length and Actual length. For a fixed width data representation, you can use Pad char (from the general section of Type defaults) to specify a repeated trailing character if the value you specify is shorter than the fixed width of the field. Prefix bytes. You can use this option with variable-length fields. Variable-length fields can be either delimited by a character or preceded by a 1-, 2-, or 4-byte prefix containing the field length. DataStage reads the length prefix but does not include the prefix as a separate field in the data set it reads from the file. This property is mutually exclusive with the Delimiter, Quote, and Final Delimiter properties, which are used by default. Print field. This property is intended for use when debugging jobs. Set it to have DataStage produce a message for every field it reads. The message has the format:
Importing N: D

where:

N is the field name. D is the imported data of the field. Non-printable characters conained in D are prefixed with an escape character and written as C string literals; if the field contains binary data, it is output in octal format.

Quote. Specifies that variable length fields are enclosed in single quotes, double quotes, or another character or pair of characters. Choose Single or Double, or enter a character. This is set to double quotes by default. When reading, DataStage ignores the leading quote character and reads all bytes up to but not including the trailing quote character. Vector prefix. For fields that are variable length vectors, specifies that a 1-, 2-, or 4-byte prefix contains the number of elements in the vector. You can override this default prefix for individual vectors.

Parallel Job Developers Guide

6-29

Outputs Page

File Set Stage

Variable-length vectors must use either a prefix on the vector or a link to another field in order to specify the number of elements in the vector. If the variable length vector has a prefix, you use this property to indicate the prefix length. DataStage reads the length prefix but does not include it as a separate field in the data set. By default, the prefix length is assumed to be one byte. Type Defaults These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type. General These properties apply to several data types (unless overridden at column level): Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

little-endian. The high byte is on the right. big-endian. The high byte is on the left. native-endian. As defined by the native format of the machine. This is the default.

Data Format. Specifies the data representation format of a field. Applies to fields of all data types except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that is neither string nor raw. Choose from:

binary text (the default)

A setting of binary has different meanings when applied to different data types:

For decimals, binary means packed. For other numerical data types, binary means not text . For dates, binary is equivalent to specifying the julian property for the date field. For time, binary is equivalent to midnight_seconds. For timestamp, binary specifies that the first integer contains a Julian day count for the date portion of the timestamp and the second integer specifies the time portion of the timestamp as the number of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.

6-30

Parallel Job Developers Guide

File Set Stage

Outputs Page

By default data is formatted as text, as follows:

For the date data type, text specifies that the data read, contains a text-based date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the decimal data type: a field represents a decimal in a string format with a leading space or '-' followed by decimal digits with an embedded decimal point if the scale is not zero. The destination string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored. For numeric fields (int8, int16, int32, uint8, uint16, uint32, sfloat, and dfloat): DataStage assumes that numeric fields are represented as text. For the time data type: text specifies that the field represents time in the text-based form %hh:%nn:%ss or or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the timestamp data type: text specifies a text-based timestamp in the form %yyyy-%mm-%dd %hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see NLS Guide).

(See page 2-28 for a description of data types.) Field max width. The maximum number of bytes in a column represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width character set, you can calculate the length exactly. If you are using variable-length character set, calculate an adequate maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. Field width. The number of bytes in a field represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the number of bytes exactly. If its a variable length encoding, base your calculation on the width and frequency of your variable-width characters. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. If you specify neither field width nor field max width, numeric fields written as text have the following number of bytes as their maximum width:

8-bit signed or unsigned integers: 4 bytes 16-bit signed or unsigned integers: 6 bytes

Parallel Job Developers Guide

6-31

Outputs Page

File Set Stage

32-bit signed or unsigned integers: 11 bytes 64-bit signed or unsigned integers: 21 bytes single-precision float: 14 bytes (sign, digit, decimal point, 7 fraction, "E", sign, 2 exponent) double-precision float: 24 bytes (sign, digit, decimal point, 16 fraction, "E", sign, 3 exponent)

Pad char. This property is ignored for output links. Character set. Specifies the character set. Choose from ASCII or EBCDIC. The default is ASCII. Applies to all data types except raw and ustring and record, subrec, or tagged containing no fields other than raw or ustring. String These properties are applied to columns with a string data type, unless overridden at column level. Export EBCDIC as ASCII. Not relevant for output links. Import ASCII as EBCDIC. Select this to specify that ASCII characters are read as EBCDIC characters. For ASCII-EBCDIC and EBCDIC-ASCII coversion tables, see DataStage Developers Help. Decimal These properties are applied to columns with a decimal data type unless overridden at column level. Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No. The default is No. Decimal separator. Specify the ASCII character that acts as the decimal separator (period by default). Packed. Select an option to specify what the decimal columns contain, choose from:

Yes to specify that the decimal fields contain data in packed decimal format (the default). This has the following subproperties: Check. Select Yes to verify that data is packed, or No to not verify. Signed. Select Yes to use the existing sign when reading decimal fields. Select No to write a positive sign (0xf) regardless of the fields actual sign value.

6-32

Parallel Job Developers Guide

File Set Stage

Outputs Page

No (separate) to specify that they contain unpacked decimal with a separate sign byte. This has the following sub-property: Sign Position. Choose leading or trailing as appropriate.

No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This has the following subproperty: Sign Position. Choose leading or trailing as appropriate.

No (overpunch) to specify that the field has a leading or end byte that contains a character which specifies both the numeric value of that byte and whether the number as a whole is negatively or positively signed. This has the following subproperty: Sign Position. Choose leading or trailing as appropriate.

Precision. Specifies the precision of a packed decimal. Enter a number. Rounding. Specifies how to round the source field to fit into the destination decimal when reading a source field to a decimal. Choose from:

up (ceiling). Truncate source column towards positive infinity. This mode corresponds to the IEEE 754 Round Up mode. For example, 1.4 becomes 2, -1.6 becomes -1. down (floor). Truncate source column towards negative infinity. This mode corresponds to the IEEE 754 Round Down mode. For example, 1.6 becomes 1, -1.4 becomes -2. nearest value. Round the source column towards the nearest representable value. This mode corresponds to the COBOL ROUNDED mode. For example, 1.4 becomes 1, 1.5 becomes 2, 1.4 becomes -1, -1.5 becomes -2. truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, truncate to the scale size of the destination decimal. This mode corresponds to the COBOL INTEGER-PART function. Using this method 1.6 becomes 1, -1.6 becomes -1.

Scale. Specifies the scale of a source packed decimal. Numeric These properties apply to integer and float fields unless overridden at column level.

Parallel Job Developers Guide

6-33

Outputs Page

File Set Stage

C_format. Perform non-default conversion of data from string data to a integer or floating-point. This property specifies a Clanguage format string used for reading integer or floating point strings. This is passed to sscanf(). For example, specifying a Cformat of %x and a field width of 8 ensures that a 32-bit integer is formatted as an 8-byte hexadecimal string. In_format. Format string used for conversion of data from string to integer or floating-point data This is passed to sscanf(). By default, DataStage invokes the C sscanf() function to convert a numeric field formatted as a string to either integer or floating point data. If this function does not output data in a satisfactory format, you can specify the in_format property to pass formatting arguments to sscanf(). Out_format. This property is not relevant for output links. Date These properties are applied to columns with a date data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text. Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). Format string. The string format of a date. By default this is %yyyy-%mm-%dd. The Format string can contain one or a combination of the following elements:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff, for example %1970yy. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 366). %mmm: Three-character month abbreviation.

The format_string is subject to the following restrictions:


It cannot have more than one element of the same type, for example it cannot contain two %dd elements. It cannot have both %dd and %ddd. It cannot have both %yy and %yyyy. It cannot have both %mm and %ddd.
Parallel Job Developers Guide

6-34

File Set Stage

Outputs Page

It cannot have both %mmm and %ddd. It cannot have both %mm and %mmm. If it has %dd, it must have %mm or %mmm. It must have exactly one of %yy or %yyyy.

When you specify a date format string, prefix each component with the percent symbol (%). Separate the strings components with any character except the percent sign (%). If this format string does not include a day, it is set to the first of the month in the destination field. If the format string does not include the month and day, they default to January 1. Note that the format string must contain a month if it also contains a day; that is, you cannot omit only the month. The year_cutoff is the year defining the beginning of the century in which all two digit years fall. By default, the year cutoff is 1900; therefore, a two-digit year of 97 represents 1997. You can also set this using the environment variable APT_DATE_CENTURY_BREAK_YEAR (see "APT_DATE_CENTURY_BREAK_YEAR" in Parallel Job Advanced Developers Guide), but this is overridden by %year_cutoffyy if you have set it. You can specify any four-digit year as the year cutoff. All two-digit years then specify the next possible year ending in the specified two digits that is the same or greater than the cutoff. For example, if you set the year cutoff to 1930, the twodigit year 30 corresponds to 1930, and the two-digit year 29 corresponds to 2029. Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT. Time These properties are applied to columns with a time data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text. Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss. The possible components of the time format string are:

%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date).

Parallel Job Developers Guide

6-35

Outputs Page

File Set Stage

%ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent symbol. Separate the strings components with any character except the percent sign (%). Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight. Timestamp These properties are applied to columns with a timestamp data type unless overridden at column level. Format string. Specifies the format of a column representing a timestamp as a string. Defaults to %yyyy-%mm-%dd %hh:%nn:%ss. Specify the format as follows: For the date:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 - 366).

For the time:


%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent symbol (%). Separate the strings components with any character except the percent sign (%).

6-36

Parallel Job Developers Guide

File Set Stage

Using RCP With File Set Stages

Using RCP With File Set Stages


Runtime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between. The File Set stage handles a set of sequential files. Sequential files, unlike most other data sources, do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on File Set stages if you have used the Schema File property (see "Schema File" on page 6-7) to specify a schema which describes all the columns in the sequential files referenced by the stage. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are: Sequential File File Set External Source External Target Column Import Column Export

Parallel Job Developers Guide

6-37

Using RCP With File Set Stages

File Set Stage

6-38

Parallel Job Developers Guide

7
Lookup File Set Stage
The Lookup File Set stage is a file stage. It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link. When creating Lookup file sets, one file will be created for each partition. The individual files are referenced by a single descriptor file, which by convention has the suffix .fs. When performing lookups, Lookup File stages are used in conjunction with Lookup stages. For more information about look up operations, see Chapter 20,"Merge Stage."

Parallel Job Developers Guide

7-1

Lookup File Set Stage

When using an Lookup File Set stage as a source for lookup data, there are special considerations about column naming. If you have columns of the same name in both the source and lookup data sets, note that the source data set column will go to the output data. If you want this column to be replaced by the column from the lookup data source, you need to drop the source data column before you perform the lookup (you could, for example, use a Modify stage to do this). See Chapter 20, "Merge Stage," for more details about performing lookups. When you edit a Lookup File Set stage, the Lookup File Set stage editor appears. This is based on the generic stage editor described in Chapter 3, "Stage Editors." The stage editor has up to three pages, depending on whether you are creating or referencing a file set: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is present when you are creating a lookup table. This is where you specify details about the file set being created and written to. Outputs Page. This is present when you are reading from a lookup file set, i.e., where the stage is providing a reference link to a Lookup stage. This is where you specify details about the file set being read from.

7-2

Parallel Job Developers Guide

Lookup File Set Stage

Must Dos

Must Dos
DataStage has many defaults which means that it can be very easy to include Lookup File Set stages in a job. This section specifies the minimum steps to take to get a Lookup File Set stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. The steps required depend on whether you are using the Lookup File Set stage to create a lookup file set, or using it in conjunction with a Lookup stage.

Creating a Lookup File Set:


In the Input Link Properties Tab:

Specify the key that the lookup on this file set will ultimately be performed on. You can repeat this property to specify multiple key columns. You must specify the key when you create the file set, you cannot specify it when performing the lookup. Specify the name of the Lookup File Set. Set Allow Duplicates, or accept the default setting of False.

Ensure column meta data has been specified for the lookup file set.

Looking Up a Lookup File Set:


In the Output Link Properties Tab specify the name of the lookup file set being used in the look up. Ensure column meta data has been specified for the lookup file set.

Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes. The NLS Map tab appears if you have NLS enabled on your system, it allows you to specify a character set map for the stage.

Parallel Job Developers Guide

7-3

Stage Page

Lookup File Set Stage

Advanced Tab
This tab only appears when you are using the stage to create a reference file set (i.e., where the stage has an input link). It allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the contents of the table are processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire contents of the table are processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file). Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. The NLS Map tab allows you to define a character set map for the Lookup File Set stage. This overrides the default character set map set for the project or the job. You can specify that the map be supplied as a job parameter if required. You can also select Allow per-column mapping. This allows character set maps to be specified for individual columns within the data processed by the Lookup File Set stage An extra property, NLS Map, appears in the Columns grid in the

7-4

Parallel Job Developers Guide

Lookup File Set Stage

Inputs Page

Columns tab, but note that only ustring data types allow you to set an NLS map value (see "Data Types" on page 2-28).

Inputs Page
The Inputs page allows you to specify details about how the Lookup File Set stage writes data to a file set. The Lookup File Set stage can have only one input link. The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the file set. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Lookup File Set stage properties and partitioning are given in the following sections. See Chapter 3, "Stage Editors." for a general description of the other tabs.

Input Link Properties Tab


The Properties tab allows you to specify properties for the input link. These dictate how incoming data is written to the file set. Some of the properties are mandatory, although many have default settings.

Parallel Job Developers Guide

7-5

Inputs Page

Lookup File Set Stage

Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Lookup Keys/Key Lookup Keys/Case Sensitive. Target/Lookup File Set Options/Allow Duplicates Options/Diskpool

Values
Input column True/False pathname True/False string

Default Mandatory?
N/A True N/A False N/A Y N Y Y N

Repeats? Dependent of
Y N N N N N/A Key N/A N/A N/A

Lookup Keys Category


Key Specifies the name of a lookup key column. The Key property can be repeated if there are multiple key columns. The property has a dependent property: Case Sensitive. This is a dependent property of Key and specifies whether the parent key is case sensitive or not. Set to true by default.

Target Category
Lookup File Set This property defines the file set that the incoming data will be written to. You can type in a pathname of, or browse for a file set descriptor file (by convention ending in .fs).

Options Category
Allow Duplicates Set this to cause multiple copies of duplicate records to be saved in the lookup table without a warning being issued. Two lookup records are duplicates when all lookup key columns have the same value in the two records. If you do not specify this option, DataStage issues a

7-6

Parallel Job Developers Guide

Lookup File Set Stage

Inputs Page

warning message when it encounters duplicate records and discards all but the first of the matching records. Diskpool This is an optional property. Specify the name of the disk pool into which to write the file set. You can also specify a job parameter.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the lookup file set. It also allows you to specify that the data should be sorted before being written. By default the stage will write to the file set in entire mode. The complete data set is written to each partition. If the Lookup File Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default (auto) collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Lookup File Set stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Lookup File Set stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type dropdown list. This will override any current partitioning. If the Lookup File Set stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: Entire. Each file written to receives the entire data set. This is the default partitioning method for the Lookup File Set stage. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator.
Parallel Job Developers Guide 7-7

Inputs Page

Lookup File Set Stage

Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default method for the Lookup Data Set stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. This is the default method for the Lookup File Set stage. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the file or files. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

7-8

Parallel Job Developers Guide

Lookup File Set Stage

Outputs Page

If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about how the Lookup File Set stage references a file set. The Lookup File Set stage can have only one output link which is a reference link. The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. Details about Lookup File Set stage properties are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Output Link Properties Tab


The Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from the lookup table. There is only one output link property.
Category/Property
Lookup Source/Lookup File Set

Values
pathname

Default Mandatory? Repeats?


N/A Y N

Dependent of
N/A

Lookup Source Category


Lookup File Set This property defines the file set that the data will be referenced from. You can type in a pathname of, or browse for a file set descriptor file (by convention ending in .fs).

Parallel Job Developers Guide

7-9

Outputs Page

Lookup File Set Stage

7-10

Parallel Job Developers Guide

8
External Source Stage
The External Source stage is a file stage. It allows you to read data that is output from one or more source programs. The stage calls the program and passes appropriate arguments. The stage can have a single output link, and a single rejects link. It can be configured to execute in parallel or sequential mode. There is also an External Target stage which allows you to write to an external program (see Chapter 9). The External Source stage allows you to perform actions such as interface with databases not currently supported by the DataStage Enterprise Edition.

When reading output from a program, DataStage needs to know something about its format. The information required is how the data is divided into rows and how rows are divided into columns. You specify this on the Format tab. Settings for individual columns can be overridden on the Columns tab using the Edit Column Metadata dialog box. When you edit an External Source stage, the External Source stage editor appears. This is based on the generic stage editor described in Chapter 3, "Stage Editors."

Parallel Job Developers Guide

8-1

Must Dos

External Source Stage

The stage editor has two pages: Stage Page. This is always present and is used to specify general information about the stage. Outputs Page. This is where you specify details about the program or programs whose output data you are reading. There are one or two special points to note about using runtime column propagation (RCP) with External Source stages. See "Using RCP With External Source Stages" on page 8-18 for details.

Must Dos
DataStage has many defaults which means that it can be very easy to include External Source stages in a job. This section specifies the minimum steps to take to get a External Source stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use the External Source stage: In the Output Link Properties Tab:

Specify whether the stage is providing details of the program (the default) or whether details will be provided in a file (using the latter method you can provide a list of files and arguments). If using the default source method, specify the name of the source program executable. You can also specify required arguments that DataStage will pass when it calls the program. Repeat this to specify multiple program calls. If using the program file source method, specify the name of the file containing the list of program names and arguments. Specify whether to maintain any partitions in the source data (False by default). Specify how to treat rejected rows (by default the stage continues and the rows are discarded).

In the Format Tab specify format details for the source data you are reading from, or accept the defaults (variable length columns enclosed in double quotes and delimited by commas, rows delimited with UNIX newlines). Ensure that column definitions have been specified (you can use a schema file for this if required).

8-2

Parallel Job Developers Guide

External Source Stage

Stage Page

Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes. The NLS Map tab appears if you have NLS enabled on your system, it allows you to specify a character set map for the stage.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the data input from external programs is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode all the data from the source program is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. You can select Set or Clear. If you select Set, it will request that the next stage preserves the partitioning as is. Clear is the default. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

NLS Map Tab


The NLS Map tab allows you to define a character set map for the External Source stage. This overrides the default character set map set for the project or the job. You can specify that the map be supplied as a job parameter if required. You can also select Allow per-column mapping. This allows character set maps to be specified for individual columns within the data processed by the External Source stage. An extra property, NLS Map, appears in the Columns grid in
Parallel Job Developers Guide 8-3

Outputs Page

External Source Stage

the Columns tab, but note that only ustring data types allow you to set an NLS map value (see "Data Types" on page 2-28).

Outputs Page
The Outputs page allows you to specify details about how the External Source stage reads data from an external program. The External Source stage can have only one output link. It can also have a single reject link, where rows that do not match the expected schema can be sent. The Output name drop-down list allows you to choose whether you are looking at details of the main output link (the stream link) or the reject link. The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Format tab gives information about the format of the files being read. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. Details about External Source stage properties and formatting are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

8-4

Parallel Job Developers Guide

External Source Stage

Outputs Page

Output Link Properties Tab


The Properties tab allows you to specify properties for the output link. These dictate how data is read from the external program or programs. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Source/Source Program

Values
string

Default
N/A

Mandatory? Repeats?
Y if Source Method = Specific Program(s) Y if Source Method = Program File(s) Y Y

Dependent of
N/A

Source/Source Programs File

pathname

N/A

N/A

Source/Source Method

Specific Program(s)/ Program File(s) True/False

Specific Program(s)

N/A

Options/Keep File Partitions Options/Reject Mode

False

Y Y N N

N N N N

N/A N/A N/A N/A

Continue/Fail/ Continue Save N/A

Options/Schema pathname File Options/Source Name Column

column name sourceNameColum

Source Category
Source Program Specifies the name of a program providing the source data. DataStage calls the specified program and passes to it any arguments specified. You can repeat this property to specify multiple program instances with different arguments. You can use a job parameter to supply program name and arguments.

Parallel Job Developers Guide

8-5

Outputs Page

External Source Stage

Source Programs File Specifies a file containing a list of program names and arguments. You can browse for the file or specify a job parameter. You can repeat this property to specify multiple files. Source Method This property specifies whether you directly specifying a program (using the Source Program property) or using a file to specify a program (using the Source Programs File property).

Options Category
Keep File Partitions Set this to True to maintain the partitioning of the read data. Defaults to False. Reject Mode Allows you to specify behavior if a record fails to be read for some reason. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue. Schema File This is an optional property. By default the External Source stage will use the column definitions defined on the Columns tab and Schema tab as a schema for reading the file. You can, however, specify a file containing a schema instead (note, however, that if you have defined columns on the Columns tab, you should ensure these match the schema file). Type in a pathname or browse for a schema file. Source Name Column This is an optional property. It adds an extra column of type VarChar to the output of the stage, containing the pathname of the source the record is read from. You should also add this column manually to the Columns definitions to ensure that the column is not dropped if you are not using runtime column propagation, or it is turned off at some point.

Reject Link Properties


You cannot change the properties of a Reject link. The Properties tab for a reject link is blank.

8-6

Parallel Job Developers Guide

External Source Stage

Outputs Page

Similarly, you cannot edit the column definitions for a reject link. The link will use a single column of type raw carrying the row which did not match the expected schema.

Format Tab
The Format tab allows you to supply information about the format of the source data that you are reading. The tab has a similar format to the Properties tab and is described on page 3-44. If you do not alter any of the Format settings, the External Source stage will expect to read a file of the following format: Data comprises variable length columns contained within double quotes. All columns are delimited by a comma, except for the final column in a row. Rows are delimited by a UNIX newline. You can use the Format As item from the shortcut menu in the Format Tab to quickly change to a fixed-width column format, using DOS newlines as row delimiters, or producing a COBOL format file. Select a property type from the main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to set window. You can then set a value for that property in the Property Value box. Pop-up help for each of the available properties appears if you hover the mouse pointer over it. Any property that you set on this tab can be overridden at the column level by setting properties for individual columns on the Edit Column Metadata dialog box (see page 3-26). This description uses the terms record and row and field and column interchangeably. The following sections list the Property types and properties available for each type. Record level These properties define details about how data records are formatted in the flat file. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Fill char. Does not apply to output links.

Parallel Job Developers Guide

8-7

Outputs Page

External Source Stage

Final delimiter string. Specify the string written after the last column of a record in place of the column delimiter. Enter one or more characters, this precedes the record delimiter if one is used. Mutually exclusive with Final delimiter, which is the default. For example, if you set Delimiter to comma (see under "Field Defaults" for Delimiter) and Final delimiter string to , (comma space you do not need to enter the inverted commas) all fields are delimited by a comma, except the final field, which is delimited by a comma followed by an ASCII space character. DataStage skips the specified delimiter string when reading the file. Final delimiter. Specify the single character written after the last column of a record in place of the field delimiter. Type a character or select one of whitespace, end, none, null, tab, or comma. DataStage skips the specified delimiter string when reading the file. See the following diagram for an illustration.

whitespace. The last column of each record will not include any trailing white spaces found at the end of the record. end. The last column of each record does not include the field delimiter. This is the default setting. none. The last column of each record does not have a delimiter, used for fixed-width fields. null. The last column of each record is delimited by the ASCII null character. comma. The last column of each record is delimited by the ASCII comma character. tab. The last column of each record is delimited by the ASCII tab character.
Record delimiter Field 1 , Field 1 , Field 1 , Last field nl

Final Delimiter = end Field Delimiter

Field 1

Field 1

Field 1

, Last field

, nl

Final Delimiter = comma

Intact. The intact property specifies an identifier of a partial schema. A partial schema specifies that only the column(s) named in the schema can be modified by the stage. All other columns in the row are passed through unmodified. (See "Partial Schemas" in Appendix A for details.) The file containing the partial schema is specified in the Schema File property on the Outputs tab. This property has a dependent property:

8-8

Parallel Job Developers Guide

External Source Stage

Outputs Page

Check intact. Select this to force validation of the partial schema as the file or files are imported. Note that this can degrade performance.

Record delimiter string. Specify the string at the end of each record. Enter one or more characters. This is mutually exclusive with Record delimiter, which is the default, and record type and record prefix. Record delimiter. Specify the single character at the end of each record. Type a character or select one of the following:

UNIX Newline (the default) null

(To specify a DOS newline, use the Record delimiter string property set to \R\N or choose Format as DOS line terminator from the shortcut menu.) Record delimiter is mutually exclusive with Record delimiter string, Record prefix, and record type. Record length. Select Fixed where fixed length fields are being read. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes. This is not used by default (default files are commadelimited). Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. It is set to 1 by default. This is mutually exclusive with Record delimiter, which is the default, and record delimiter string and record type. Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is written as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, VBS, or VR. This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix and by default is not used. Field Defaults Defines default properties for columns read from the file or files. These are applied to all columns, but can be overridden for individual columns from the Columns tab using the Edit Column Metadata dialog box. Where you can enter a character, this can usually be an

Parallel Job Developers Guide

8-9

Outputs Page

External Source Stage

ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Actual field length. Specifies the actual number of bytes to skip if the fields length equals the setting of the null field length property. Delimiter. Specifies the trailing delimiter of all fields in the record. Type an ASCII character or select one of whitespace, end, none, null, comma, or tab. DataStage skips the delimiter when reading.

whitespace. Whitespace characters at the end of a column are ignored, i.e., are not treated as part of the column. end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the same as a setting of None which is used for fields with fixed-width columns. none. No delimiter (used for fixed-width). null. ASCII Null character is used. comma. ASCII comma character is used. tab. ASCII tab character is used.

Delimiter string. Specify the string at the end of each field. Enter one or more characters. This is mutually exclusive with Delimiter, which is the default. For example, specifying , (comma space you do not need to enter the inverted commas) specifies each field is delimited by , unless overridden for individual fields. DataStage skips the delimiter string when reading. Null field length. The length in bytes of a variable-length field that contains a null. When a variable-length field is read, a length of null field length in the source field indicates that it contains a null. This property is mutually exclusive with null field value. Null field value. Specifies the value given to a null field if the source is set to null. Can be a number, string, or C-type literal escape character. For example, you can represent a byte value by \ooo, where each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F You must use this . form to encode non-printable byte values. This property is mutually exclusive with Null field length and Actual length. For a fixed width data representation, you can use Pad char (from the general section of Type defaults) to specify a repeated trailing character if the value you specify is shorter than the fixed width of the field.

8-10

Parallel Job Developers Guide

External Source Stage

Outputs Page

Prefix bytes. You can use this option with variable-length fields. Variable-length fields can be either delimited by a character or preceded by a 1-, 2-, or 4-byte prefix containing the field length. DataStage reads the length prefix but does not include the prefix as a separate field in the data set it reads from the file. This property is mutually exclusive with the Delimiter, Quote, and Final Delimiter properties, which are used by default. Print field. This property is intended for use when debugging jobs. Set it to have DataStage produce a message for every field it reads. The message has the format:
Importing N: D

where:

N is the field name. D is the imported data of the field. Non-printable characters conained in D are prefixed with an escape character and written as C string literals; if the field contains binary data, it is output in octal format.

Quote. Specifies that variable length fields are enclosed in single quotes, double quotes, or another character or pair of characters. Choose Single or Double, or enter a character. This is set to double quotes by default. When reading, DataStage ignores the leading quote character and reads all bytes up to but not including the trailing quote character. Vector prefix. For fields that are variable length vectors, specifies that a 1-, 2-, or 4-byte prefix contains the number of elements in the vector. You can override this default prefix for individual vectors. Variable-length vectors must use either a prefix on the vector or a link to another field in order to specify the number of elements in the vector. If the variable length vector has a prefix, you use this property to indicate the prefix length. DataStage reads the length prefix but does not include it as a separate field in the data set. By default, the prefix length is assumed to be one byte. Type Defaults These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type. General These properties apply to several data types (unless overridden at column level):

Parallel Job Developers Guide

8-11

Outputs Page

External Source Stage

Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

little-endian. The high byte is on the right. big-endian. The high byte is on the left. native-endian. As defined by the native format of the machine. This is the default.

Data Format. Specifies the data representation format of a field. Applies to fields of all data types except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that is neither string nor raw. Choose from:

binary text (the default)

A setting of binary has different meanings when applied to different data types:

For decimals, binary means packed. For other numerical data types, binary means not text . For dates, binary is equivalent to specifying the julian property for the date field. For time, binary is equivalent to midnight_seconds. For timestamp, binary specifies that the first integer contains a Julian day count for the date portion of the timestamp and the second integer specifies the time portion of the timestamp as the number of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.

By default data is formatted as text, as follows:

For the date data type, text specifies that the data read, contains a text-based date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the decimal data type: a field represents a decimal in a string format with a leading space or '-' followed by decimal digits with an embedded decimal point if the scale is not zero. The destination string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored. For numeric fields (int8, int16, int32, uint8, uint16, uint32, sfloat, and dfloat): DataStage assumes that numeric fields are represented as text.

8-12

Parallel Job Developers Guide

External Source Stage

Outputs Page

For the time data type: text specifies that the field represents time in the text-based form %hh:%nn:%ss or or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the timestamp data type: text specifies a text-based timestamp in the form %yyyy-%mm-%dd %hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see NLS Guide).

(See page 2-28 for a description of data types.) Field max width. The maximum number of bytes in a column represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width character set, you can calculate the length exactly. If you are using variable-length character set, calculate an adequate maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. Field width. The number of bytes in a field represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the number of bytes exactly. If its a variable length encoding, base your calculation on the width and frequency of your variable-width characters. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. If you specify neither field width nor field max width, numeric fields written as text have the following number of bytes as their maximum width:

8-bit signed or unsigned integers: 4 bytes 16-bit signed or unsigned integers: 6 bytes 32-bit signed or unsigned integers: 11 bytes 64-bit signed or unsigned integers: 21 bytes single-precision float: 14 bytes (sign, digit, decimal point, 7 fraction, "E", sign, 2 exponent) double-precision float: 24 bytes (sign, digit, decimal point, 16 fraction, "E", sign, 3 exponent)

Pad char. This property is ignored for output links. Character set. Specifies the character set. Choose from ASCII or EBCDIC. The default is ASCII. Applies to all data types except raw and ustring and record, subrec, or tagged containing no fields other than raw or ustring.

Parallel Job Developers Guide

8-13

Outputs Page

External Source Stage

String These properties are applied to columns with a string data type, unless overridden at column level. Export EBCDIC as ASCII. Not relevant for output links. Import ASCII as EBCDIC. Select this to specify that ASCII characters are read as EBCDIC characters. For ASCII-EBCDIC and EBCDIC-ASCII coversion tables, see DataStage Developers Help. Decimal These properties are applied to columns with a decimal data type unless overridden at column level. Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No. The default is No. Decimal separator. Specify the ASCII character that acts as the decimal separator (period by default). Packed. Select an option to specify what the decimal columns contain, choose from:

Yes to specify that the decimal fields contain data in packed decimal format (the default). This has the following subproperties: Check. Select Yes to verify that data is packed, or No to not verify. Signed. Select Yes to use the existing sign when reading decimal fields. Select No to write a positive sign (0xf) regardless of the fields actual sign value.

No (separate) to specify that they contain unpacked decimal with a separate sign byte. This has the following sub-property: Sign Position. Choose leading or trailing as appropriate.

No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This has the following subproperty: Sign Position. Choose leading or trailing as appropriate.

No (overpunch) to specify that the field has a leading or end byte that contains a character which specifies both the numeric value of that byte and whether the number as a whole is negatively or positively signed. This has the following subproperty:

8-14

Parallel Job Developers Guide

External Source Stage

Outputs Page

Sign Position. Choose leading or trailing as appropriate. Precision. Specifies the precision of a packed decimal. Enter a number. Rounding. Specifies how to round the source field to fit into the destination decimal when reading a source field to a decimal. Choose from:

up (ceiling). Truncate source column towards positive infinity. This mode corresponds to the IEEE 754 Round Up mode. For example, 1.4 becomes 2, -1.6 becomes -1. down (floor). Truncate source column towards negative infinity. This mode corresponds to the IEEE 754 Round Down mode. For example, 1.6 becomes 1, -1.4 becomes -2. nearest value. Round the source column towards the nearest representable value. This mode corresponds to the COBOL ROUNDED mode. For example, 1.4 becomes 1, 1.5 becomes 2, 1.4 becomes -1, -1.5 becomes -2. truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, truncate to the scale size of the destination decimal. This mode corresponds to the COBOL INTEGER-PART function. Using this method 1.6 becomes 1, -1.6 becomes -1.

Scale. Specifies the scale of a source packed decimal. Numeric These properties apply to integer and float fields unless overridden at column level. C_format. Perform non-default conversion of data from string data to a integer or floating-point. This property specifies a Clanguage format string used for reading integer or floating point strings. This is passed to sscanf(). For example, specifying a Cformat of %x and a field width of 8 ensures that a 32-bit integer is formatted as an 8-byte hexadecimal string. In_format. Format string used for conversion of data from string to integer or floating-point data This is passed to sscanf(). By default, DataStage invokes the C sscanf() function to convert a numeric field formatted as a string to either integer or floating point data. If this function does not output data in a satisfactory format, you can specify the in_format property to pass formatting arguments to sscanf(). Out_format. This property is not relevant for output links.

Parallel Job Developers Guide

8-15

Outputs Page

External Source Stage

Date These properties are applied to columns with a date data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text. Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). Format string. The string format of a date. By default this is %yyyy-%mm-%dd. The Format string can contain one or a combination of the following elements:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff, for example %1970yy. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 366). %mmm: Three-character month abbreviation.

The format_string is subject to the following restrictions:


It cannot have more than one element of the same type, for example it cannot contain two %dd elements. It cannot have both %dd and %ddd. It cannot have both %yy and %yyyy. It cannot have both %mm and %ddd. It cannot have both %mmm and %ddd. It cannot have both %mm and %mmm. If it has %dd, it must have %mm or %mmm. It must have exactly one of %yy or %yyyy.

When you specify a date format string, prefix each component with the percent symbol (%). Separate the strings components with any character except the percent sign (%). If this format string does not include a day, it is set to the first of the month in the destination field. If the format string does not include the month and day, they default to January 1. Note that the format string must contain a month if it also contains a day; that is, you cannot omit only the month.

8-16

Parallel Job Developers Guide

External Source Stage

Outputs Page

The year_cutoff is the year defining the beginning of the century in which all two digit years fall. By default, the year cutoff is 1900; therefore, a two-digit year of 97 represents 1997. You can also set this using the environment variable APT_DATE_CENTURY_BREAK_YEAR (see "APT_DATE_CENTURY_BREAK_YEAR" in Parallel Job Advanced Developers Guide), but this is overridden by %year_cutoffyy if you have set it. You can specify any four-digit year as the year cutoff. All two-digit years then specify the next possible year ending in the specified two digits that is the same or greater than the cutoff. For example, if you set the year cutoff to 1930, the twodigit year 30 corresponds to 1930, and the two-digit year 29 corresponds to 2029. Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT. Time These properties are applied to columns with a time data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text. Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss. The possible components of the time format string are:

%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent symbol. Separate the strings components with any character except the percent sign (%). Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight.

Parallel Job Developers Guide

8-17

Using RCP With External Source Stages

External Source Stage

Timestamp These properties are applied to columns with a timestamp data type unless overridden at column level. Format string. Specifies the format of a column representing a timestamp as a string. Defaults to %yyyy-%mm-%dd %hh:%nn:%ss. Specify the format as follows: For the date:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 - 366).

For the time:


%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent symbol (%). Separate the strings components with any character except the percent sign (%).

Using RCP With External Source Stages


Runtime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between.

8-18

Parallel Job Developers Guide

External Source Stage

Using RCP With External Source Stages

External Source stages, unlike most other data sources, do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on External Source stages if you have used the Schema File property (see "Schema File" on page 8-6) to specify a schema which describes all the columns in the sequential files referenced by the stage. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are: Sequential File File Set External Source External Target Column Import Column Export

Parallel Job Developers Guide

8-19

Using RCP With External Source Stages

External Source Stage

8-20

Parallel Job Developers Guide

9
External Target Stage
The External Target stage is a file stage. It allows you to write data to one or more source programs. The stage can have a single input link and a single rejects link. It can be configured to execute in parallel or sequential mode. There is also an External Source stage, which allows you to read from an external program (see Chapter 8) The External Target stage allows you to perform actions such as interface with databases not currently supported by the DataStage Parallel Extender.

When writing to a program, DataStage needs to know something about how to format the data. The information required is how the data is divided into rows and how rows are divided into columns. You specify this on the Format tab. Settings for individual columns can be overridden on the Columns tab using the Edit Column Metadata dialog box. When you edit an External Target stage, the External Target stage editor appears. This is based on the generic stage editor described in Chapter 3, "Stage Editors." The stage editor has up to three pages:

Parallel Job Developers Guide

9-1

Must Dos

External Target Stage

Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify details about the program or programs you are writing data to. Outputs Page. This appears if the stage has a rejects link. There are one or two special points to note about using runtime column propagation (RCP) with External Target stages. See "Using RCP With External Target Stages" on page 9-21 for details.

Must Dos
DataStage has many defaults which means that it can be very easy to include External Target stages in a job. This section specifies the minimum steps to take to get a External Target stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use the External Target stage: In the Input Link Properties Tab:

Specify whether the stage is providing details of the program (the default) or whether details will be provided in a file (using the latter method you can provide a list of files and arguments). If using the default target method, specify the name of the target program executable. You can also specify required arguments that DataStage will pass when it calls the program. Repeat this to specify multiple program calls. If using the program file target method, specify the name of the file containing the list of program names and arguments. Specify whether to delete partially written data if the write fails for some reason (True by default). Specify how to treat rejected rows (by default the stage continues and the rows are discarded).

In the Format Tab specify format details for the data you are writing, or accept the defaults (variable length columns enclosed in double quotes and delimited by commas, rows delimited with UNIX newlines). Ensure that column definitions have been specified (this can be done in an earlier stage or in a schema file).

9-2

Parallel Job Developers Guide

External Target Stage

Stage Page

Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes. The NLS Map tab appears if you have NLS enabled on your system, it allows you to specify a character set map for the stage.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the data output to external programs is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode all the data from the source program is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. You can select Set or Clear. If you select Set, it will request that the next stage preserves the partitioning as is. Clear is the default. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

NLS Map Tab


The NLS Map tab allows you to define a character set map for the External Target stage. This overrides the default character set map set for the project or the job. You can specify that the map be supplied as a job parameter if required. You can also select Allow per-column mapping. This allows character set maps to be specified for individual columns within the data processed by the External Target stage. An extra property, NLS Map, appears in the Columns grid in
Parallel Job Developers Guide 9-3

Inputs Page

External Target Stage

the Columns tab, but note that only ustring data types allow you to set an NLS map value (see "Data Types" on page 2-28).

Inputs Page
The Inputs page allows you to specify details about how the External Target stage writes data to an external program. The External Target stage can have only one input link. The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the external program. The Format tab gives information about the format of the data being written. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the input link. Details about External Target stage properties, partitioning, and formatting are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Input Link Properties Tab


The Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and to what program. Some of the properties are mandatory, although many have default

9-4

Parallel Job Developers Guide

External Target Stage

Inputs Page

settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/Property Values
Target /Destination Program string

Default
N/A

Mandatory? Repeats?
Y if Source Method = Specific Program(s) Y if Source Method = Program File(s) Y Y

Dependent of
N/A

Target /Destination Programs File

pathname

N/A

N/A

Target /Target Method

Specific Program(s)/ Program File(s)

Specific Program(s)

N/A

Options/Reject Mode Options/Schema File

Continue/Fail/ Continue Save pathname N/A

N N

N N

N/A N/A

Target Category
Destination Program This is an optional property. Specifies the name of a program receiving data. DataStage calls the specified program and passes to it any arguments specified.You can repeat this property to specify multiple program instances with different arguments. You can use a job parameter to supply program name and arguments. Destination Programs File This is an optional property. Specifies a file containing a list of program names and arguments. You can browse for the file or specify a job parameter. You can repeat this property to specify multiple files. Target Method This property specifies whether you directly specifying a program (using the Destination Program property) or using a file to specify a program (using the Destination Programs File property).

Parallel Job Developers Guide

9-5

Inputs Page

External Target Stage

Options Category
Reject Mode This is an optional property. Allows you to specify behavior if a record fails to be written for some reason. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue. Schema File This is an optional property. By default the External Target stage will use the column definitions defined on the Columns tab as a schema for writing the file. You can, however, specify a file containing a schema instead (note, however, that if you have defined columns on the Columns tab, you should ensure these match the schema file). Type in a pathname or browse for a schema file.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the target program. It also allows you to specify that the data should be sorted before being written. By default the stage will partition data in Auto mode. If the External Target stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the External Target stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the External Target stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning type drop-down list. This will override any current partitioning. If the External Target stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default Auto collection method. The following partitioning methods are available:

9-6

Parallel Job Developers Guide

External Target Stage

Inputs Page

(Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the External Target stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag columns. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default method for the External Target stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the target program. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the

Parallel Job Developers Guide

9-7

Inputs Page

External Target Stage

collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Format Tab
The Format tab allows you to supply information about the format of the data you are writing. The tab has a similar format to the Properties tab and is described on page 3-44. If you do not alter any of the Format settings, the External Target stage will produce a file of the following format: Data comprises variable length columns contained within double quotes. All columns are delimited by a comma, except for the final column in a row. Rows are delimited by a UNIX newline. You can use the Format As item from the shortcut menu in the Format Tab to quickly change to a fixed-width column format, using DOS newlines as row delimiters, or producing a COBOL format file. To change individual properties, select a property type from the main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to set window. You can then set a value for that property in the Property Value box. Pop up help for each of the available properties appears if you hover the mouse pointer over it.

9-8

Parallel Job Developers Guide

External Target Stage

Inputs Page

This description uses the terms record and row and field and column interchangeably. The following sections list the Property types and properties available for each type. Record level These properties define details about how data records are formatted in the flat file. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Fill char. Specify an ASCII character or a value in the range 0 to 255. You can also choose Space or Null from a drop-down list. This character is used to fill any gaps in a written record caused by column positioning properties. Set to 0 by default (which is the NULL character). For example, to set it to space you could also type in the space character or enter 32. Note that this value is restricted to one byte, so you cannot specify a multi-byte Unicode character. Final delimiter string. Specify a string to be written after the last column of a record in place of the column delimiter. Enter one or more characters, this precedes the record delimiter if one is used. Mutually exclusive with Final delimiter, which is the default. For example, if you set Delimiter to comma (see under "Field Defaults" for Delimiter) and Final delimiter string to , (comma space you do not need to enter the inverted commas) all fields are delimited by a comma, except the final field, which is delimited by a comma followed by an ASCII space character. Final delimiter. Specify a single character to be written after the last column of a record in place of the field delimiter. Type a character or select one of whitespace, end, none, null, tab, or comma. See the following diagram for an illustration.

whitespace. The last column of each record will not include any trailing white spaces found at the end of the record. end. The last column of each record does not include the field delimiter. This is the default setting. none. The last column of each record does not have a delimiter; used for fixed-width fields. null. The last column of each record is delimited by the ASCII null character. comma. The last column of each record is delimited by the ASCII comma character.

Parallel Job Developers Guide

9-9

Inputs Page

External Target Stage

tab. The last column of each record is delimited by the ASCII tab character.
Record delimiter Field 1 , Field 1 , Field 1 , Last field nl

Final Delimiter = end Field Delimiter

Field 1

Field 1

Field 1

, Last field

, nl

Final Delimiter = comma

When writing, a space is now inserted after every field except the last in the record. Previously, a space was inserted after every field including the last. (If you want to revert to the pre-release 7.5 behavior of inserting a space after the last field, set the APT_FINAL_DELIM_COMPATIBLE environment variable. Intact. The intact property specifies an identifier of a partial schema. A partial schema specifies that only the column(s) named in the schema can be modified by the stage. All other columns in the row are passed through unmodified. (See "Partial Schemas" in Appendix A for details.) The file containing the partial schema is specified in the Schema File property on the Properties tab (see page 5-9). This property has a dependent property, Check intact, but this is not relevant to input links. Record delimiter string. Specify a string to be written at the end of each record. Enter one or more characters. This is mutually exclusive with Record delimiter, which is the default, record type and record prefix. Record delimiter. Specify a single character to be written at the end of each record. Type a character or select one of the following:

UNIX Newline (the default) null

(To implement a DOS newline, use the Record delimiter string property set to \R\N or choose Format as DOS line terminator from the shortcut menu.) Record delimiter is mutually exclusive with Record delimiter string, Record prefix, and Record type. Record length. Select Fixed where fixed length fields are being written. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes. This is not used by default (default files are commadelimited). The record is padded to the specified length with either zeros or the fill character if one has been specified.

9-10

Parallel Job Developers Guide

External Target Stage

Inputs Page

Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. It is set to 1 by default. This is mutually exclusive with Record delimiter, which is the default, and record delimiter string and record type. Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is written as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, VBS, or VR. This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix and by default is not used. Field Defaults Defines default properties for columns written to the file or files. These are applied to all columns written, but can be overridden for individual columns from the Columns tab using the Edit Column Metadata dialog box. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Actual field length. Specifies the number of bytes to fill with the Fill character when a field is identified as null. When DataStage identifies a null field, it will write a field of this length full of Fill characters. This is mutually exclusive with Null field value. Delimiter. Specifies the trailing delimiter of all fields in the record. Type an ASCII character or select one of whitespace, end, none, null, comma, or tab.

whitespace. Whitespace characters at the end of a column are ignored, i.e., are not treated as part of the column. end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the same as a setting of None which is used for fields with fixed-width columns. none. No delimiter (used for fixed-width). null. ASCII Null character is used. comma. ASCII comma character is used. tab. ASCII tab character is used.

Delimiter string. Specify a string to be written at the end of each field. Enter one or more characters. This is mutually exclusive with Delimiter, which is the default. For example, specifying , (comma

Parallel Job Developers Guide

9-11

Inputs Page

External Target Stage

space you do not need to enter the inverted commas) would have each field delimited by , unless overridden for individual fields. Null field length. The length in bytes of a variable-length field that contains a null. When a variable-length field is written, DataStage writes a length value of null field length if the field contains a null. This property is mutually exclusive with null field value. Null field value. Specifies the value written to null field if the source is set to null. Can be a number, string, or C-type literal escape character. For example, you can represent a byte value by \ooo, where each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F You must use this . form to encode non-printable byte values. This property is mutually exclusive with Null field length and Actual length. For a fixed width data representation, you can use Pad char (from the general section of Type defaults) to specify a repeated trailing character if the value you specify is shorter than the fixed width of the field. Prefix bytes. Specifies that each column in the data file is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the columns length or the tag value for a tagged field. You can use this option with variable-length fields. Variable-length fields can be either delimited by a character or preceded by a 1-, 2, or 4-byte prefix containing the field length. DataStage inserts the prefix before each field. This property is mutually exclusive with the Delimiter, Quote, and Final Delimiter properties, which are used by default. Print field. This property is not relevant for input links. Quote. Specifies that variable length fields are enclosed in single quotes, double quotes, or another character or pair of characters. Choose Single or Double, or enter a character. This is set to double quotes by default. When writing, DataStage inserts the leading quote character, the data, and a trailing quote character. Quote characters are not counted as part of a fields length. Vector prefix. For fields that are variable length vectors, specifies a 1-, 2-, or 4-byte prefix containing the number of elements in the vector. You can override this default prefix for individual vectors. Variable-length vectors must use either a prefix on the vector or a link to another field in order to specify the number of elements in

9-12

Parallel Job Developers Guide

External Target Stage

Inputs Page

the vector. If the variable length vector has a prefix, you use this property to indicate the prefix length. DataStage inserts the element count as a prefix of each variable-length vector field. By default, the prefix length is assumed to be one byte. Type Defaults These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type. General These properties apply to several data types (unless overridden at column level): Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

little-endian. The high byte is on the right. big-endian. The high byte is on the left. native-endian. As defined by the native format of the machine. This is the default.

Data Format. Specifies the data representation format of a field. Applies to fields of all data types except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that is neither string nor raw. Choose from:

binary text (the default)

A setting of binary has different meanings when applied to different data types:

For decimals, binary means packed. For other numerical data types, binary means not text . For dates, binary is equivalent to specifying the julian property for the date field. For time, binary is equivalent to midnight_seconds. For timestamp, binary specifies that the first integer contains a Julian day count for the date portion of the timestamp and the second integer specifies the time portion of the timestamp as the number of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.

By default data is formatted as text, as follows:

Parallel Job Developers Guide

9-13

Inputs Page

External Target Stage

For the date data type, text specifies that the data to be written contains a text-based date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the decimal data type: a field represents a decimal in a string format with a leading space or '-' followed by decimal digits with an embedded decimal point if the scale is not zero. The destination string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored. For numeric fields (int8, int16, int32, uint8, uint16, uint32, sfloat, and dfloat): DataStage assumes that numeric fields are represented as text. For the time data type: text specifies that the field represents time in the text-based form %hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the timestamp data type: text specifies a text-based timestamp in the form %yyyy-%mm-%dd %hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see NLS Guide).

(See page 2-28 for a description of data types.) Field max width. The maximum number of bytes in a column represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width character set, you can calculate the length exactly. If you are using variable-length character set, calculate an adequate maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. Field width. The number of bytes in a field represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the number of bytes exactly. If its a variable length encoding, base your calculation on the width and frequency of your variable-width characters. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. If you specify neither field width nor field max width, numeric fields written as text have the following number of bytes as their maximum width:

8-bit signed or unsigned integers: 4 bytes 16-bit signed or unsigned integers: 6 bytes 32-bit signed or unsigned integers: 11 bytes

9-14

Parallel Job Developers Guide

External Target Stage

Inputs Page

64-bit signed or unsigned integers: 21 bytes single-precision float: 14 bytes (sign, digit, decimal point, 7 fraction, "E", sign, 2 exponent) double-precision float: 24 bytes (sign, digit, decimal point, 16 fraction, "E", sign, 3 exponent)

Pad char. Specifies the pad character used when strings or numeric values are written to an external string representation. Enter a character (single-byte for strings, can be multi-byte for ustrings) or choose null or space. The pad character is used when the external string representation is larger than required to hold the written field. In this case, the external string is filled with the pad character to its full length. Space is the default. Applies to string, ustring, and numeric data types and record, subrec, or tagged types if they contain at least one field of this type. Character set. Specifies the character set. Choose from ASCII or EBCDIC. The default is ASCII. Applies to all data types except raw and ustring and record, subrec, or tagged containing no fields other than raw or ustring. String These properties are applied to columns with a string data type, unless overridden at column level. Export EBCDIC as ASCII. Select this to specify that EBCDIC characters are written as ASCII characters. Applies to fields of the string data type and record, subrec, or tagged fields if they contain at least one field of this type. Import ASCII as EBCDIC. Not relevant for input links. For ASCII-EBCDIC and EBCDIC-ASCII conversion tables, see DataStage Developers Help. Decimal These properties are applied to columns with a decimal data type unless overridden at column level. Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No. The default is No. Decimal separator. Specify the ASCII character that acts as the decimal separator (period by default). Packed. Select an option to specify what the decimal columns contain, choose from:

Parallel Job Developers Guide

9-15

Inputs Page

External Target Stage

Yes to specify that the decimal columns contain data in packed decimal format (the default). This has the following subproperties:

Check. Select Yes to verify that data is packed, or No to not verify. Signed. Select Yes to use the existing sign when writing decimal columns. Select No to write a positive sign (0xf) regardless of the columns actual sign value.

No (separate) to specify that they contain unpacked decimal with a separate sign byte. This has the following sub-property:

Sign Position. Choose leading or trailing as appropriate.

No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This has the following subproperty:

Sign Position. Choose leading or trailing as appropriate.

No (overpunch) to specify that the field has a leading or end byte that contains a character which specifies both the numeric value of that byte and whether the number as a whole is negatively or positively signed. This has the following subproperty:

Sign Position. Choose leading or trailing as appropriate. Precision. Specifies the precision where a decimal column is written in text format. Enter a number. When a decimal is written to a string representation, DataStage uses the precision and scale defined for the source decimal field to determine the length of the destination string. The precision and scale properties override this default. When they are defined, DataStage truncates or pads the source decimal to fit the size of the destination string. If you have also specified the field width property, DataStage truncates or pads the source decimal to fit the size specified by field width. Rounding. Specifies how to round a decimal column when writing it. Choose from:

up (ceiling). Truncate source column towards positive infinity. This mode corresponds to the IEEE 754 Round Up mode. For example, 1.4 becomes 2, -1.6 becomes -1. down (floor). Truncate source column towards negative infinity. This mode corresponds to the IEEE 754 Round Down mode. For example, 1.6 becomes 1, -1.4 becomes -2.

9-16

Parallel Job Developers Guide

External Target Stage

Inputs Page

nearest value. Round the source column towards the nearest representable value. This mode corresponds to the COBOL ROUNDED mode. For example, 1.4 becomes 1, 1.5 becomes 2, 1.4 becomes -1, -1.5 becomes -2. truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, truncate to the scale size of the destination decimal. This mode corresponds to the COBOL INTEGER-PART function. Using this method 1.6 becomes 1, -1.6 becomes -1.

Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination. By default, when the DataStage writes a source decimal to a string representation, it uses the precision and scale defined for the source decimal field to determine the length of the destination string. You can override the default by means of the precision and scale properties. When you do, DataStage truncates or pads the source decimal to fit the size of the destination string. If you have also specified the field width property, DataStage truncates or pads the source decimal to fit the size specified by field width. Numeric These properties apply to integer and float fields unless overridden at column level. C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a Clanguage format string used for writing integer or floating point strings. This is passed to sprintf(). For example, specifying a Cformat of %x and a field width of 8 ensures that integers are written as 8-byte hexadecimal strings. In_format. This property is not relevant for input links.. Out_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sprintf(). By default, DataStage invokes the C sprintf() function to convert a numeric field formatted as either integer or floating point data to a string. If this function does not output data in a satisfactory format, you can specify the out_format property to pass formatting arguments to sprintf(). Date These properties are applied to columns with a date data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text.

Parallel Job Developers Guide

9-17

Inputs Page

External Target Stage

Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). Format string. The string format of a date. By default this is %yyyy-%mm-%dd. The Format string can contain one or a combination of the following elements:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff, for example %1970yy. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 366). %mmm: Three-character month abbreviation.

The format_string is subject to the following restrictions:


It cannot have more than one element of the same type, for example it cannot contain two %dd elements. It cannot have both %dd and %ddd. It cannot have both %yy and %yyyy. It cannot have both %mm and %ddd. It cannot have both %mmm and %ddd. It cannot have both %mm and %mmm. If it has %dd, it must have %mm or %mmm. It must have exactly one of %yy or %yyyy.

When you specify a date format string, prefix each component with the percent symbol (%). Separate the strings components with any character except the percent sign (%). If this format string does not include a day, it is set to the first of the month in the destination field. If the format string does not include the month and day, they default to January 1. Note that the format string must contain a month if it also contains a day; that is, you cannot omit only the month. The year_cutoff is the year defining the beginning of the century in which all two digit years fall. By default, the year cutoff is 1900; therefore, a two-digit year of 97 represents 1997. You can also set this using the environment variable

9-18

Parallel Job Developers Guide

External Target Stage

Inputs Page

APT_DATE_CENTURY_BREAK_YEAR (see "APT_DATE_CENTURY_BREAK_YEAR" in Parallel Job Advanced Developers Guide), but this is overridden by %year_cutoffyy if you have set it. You can specify any four-digit year as the year cutoff. All two-digit years then specify the next possible year ending in the specified two digits that is the same or greater than the cutoff. For example, if you set the year cutoff to 1930, the two-digit year 30 corresponds to 1930, and the two-digit year 29 corresponds to 2029. Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT. Time These properties are applied to columns with a time data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text. Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss. The possible components of the time format string are:

%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent symbol. Separate the strings components with any character except the percent sign (%). Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight. Timestamp These properties are applied to columns with a timestamp data type unless overridden at column level.

Parallel Job Developers Guide

9-19

Outputs Page

External Target Stage

Format string. Specifies the format of a column representing a timestamp as a string. Defaults to %yyyy-%mm-%dd %hh:%nn:%ss. Specify the format as follows: For the date:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 - 366)

For the time:


%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent sign (%). Separate the strings components with any character except the percent sign (%).

Outputs Page
The Outputs page appears if the stage has a Reject link The General tab allows you to specify an optional description of the output link. You cannot change the properties of a Reject link. The Properties tab for a reject link is blank. Similarly, you cannot edit the column definitions for a reject link. The link uses the column definitions for the link rejecting the data records.

9-20

Parallel Job Developers Guide

External Target Stage

Using RCP With External Target Stages

Using RCP With External Target Stages


Runtime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between. External Target stages, unlike most other data targets, do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on External Target stages if you have used the Schema File property (see "Schema File" on page 9-6) to specify a schema which describes all the columns in the sequential files referenced by the stage. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are: Sequential File File Set External Source External Target Column Import Column Export

Parallel Job Developers Guide

9-21

Using RCP With External Target Stages

External Target Stage

9-22

Parallel Job Developers Guide

10
Complex Flat File Stage
The Complex Flat File (CFF) stage is a file stage. You can use the stage to read a file or write a file, but you cannot use the same stage to do both. As a source, the stage can have multiple output links and a single reject link. As a target, the stage can have a single input link.
Note The interface for the CFF stage is different to that for standard parallel file stages - properties are defined in the Stage page File Options tab, format information is defined in the Stage page Record Options tab, and column information for both input and output tabs is described in the Stage page Columns tab.

When used as a source, the stage allows you to read data from one or more complex flat files, including MVS datasets with QSAM and VSAM files. A complex flat file may contain one or more GROUPs, REDEFINES, OCCURS, or OCCURS DEPENDING ON clauses. Complex Flat File source stages execute in parallel mode when they are used to read multiple files, but you can configure the stage to execute sequentially if it is only reading one file with a single reader. When used as a target, the stage allows you to write data to one or more complex flat files. It does not write to MVS datasets.

Parallel Job Developers Guide

10-1

Must Dos

Complex Flat File Stage

When you edit a CFF stage, the CFF stage editor appears. The stage editor has up to three pages, depending on whether you are reading or writing a file: Stage Page. This is always present and is used to specify general information about the stage, including details about the file or files being read from or written to. Input Page. This is present when you are writing to a complex flat file. It allows you to specify details about how data should be written to a target file, including partitioning and buffering information. Output Page. This is present when you are reading from a complex flat file. It allows you to select columns for output and change the default buffering settings on the output link if desired.

Must Dos
Ascential DataStage has many defaults which means that it can be very easy to include CFF stages in a job. This section specifies the minimum steps to take to get a CFF stage functioning. Ascential DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end. This section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use the CFF stage: In the File Options Tab, specify the stage properties. If reading a file or files:

10-2

Specify the type of file you are reading. Give the name of the file or files you are going to read. Specify the record type of the files you are reading.
Parallel Job Developers Guide

Complex Flat File Stage

Stage Page

Define what action to take if files are missing from the source. Define what action to take with records that fail to match the expected meta data.

If writing a file or files:


Specify the type of file you are writing. Give the name of the files you are writing. Specify the record type of the files you are writing. Define what action to take if records fail to be written to the target file(s).

In the Record Options Tab, describe the format of the data you are reading or writing. In the Stage page Columns Tab, define the column definitions for the data you are reading or writing using this stage.

Stage Page
The General tab allows you to specify an optional description of the stage. The File Options tab allows you to specify the stage properties, while the Record Options tab allows you to describe the format of the files that you are reading or writing. The Columns page gives the column definitions for the input or output links, while the Layout page displays the meta data either as a parallel schema or a COBOL definition. The NLS Map tab appears if you have NLS enabled on your system, it allows you to specify a character set map for the stage. The Advanced tab allows you to specify further information about how the stage is executed.

File Options Tab


The File Options tab allows you to specify properties about how data is read from or written to files. The appearance of this tab differs depending on whether the stage is being used as a source or a target.

Source CFF Stage


Source stage file options include settings for the file type and name, record type, missing file action, reject mode, multiple node reading, reporting, and file partitioning.

Parallel Job Developers Guide

10-3

Stage Page

Complex Flat File Stage

The tab has the following fields: File Type. Specifies the type of source to import. This determines how your entry in the File name(s) field is interpreted. Select one of the following:

File(s). A single file or multiple files. This is the default. File pattern. A group of files. Source. One or more programs and arguments that provide source data to the import operator. Source list. A file containing the names of multiple programs that provide source data to the import operator.

MVS dataset. Select this box to specify that the source is an MVS dataset. This appears only if the project within which you are working is USS-enabled (i.e., parallel jobs are intended to run on a USS system - see Chapter 56, "Parallel Jobs on USS.") For MVS datasets, the file type must be File(s). Neither a filter nor multiple node reading is allowed. If you enclose the filename with single quotes in the File name(s) field, Ascential DataStage will add an escape character (\) before each quote. File name(s). Type the names of the files to import, or click the arrow button to search for file names on the server. Your entry should correspond to your selection in the File type field using these guidelines:

For File(s), type either a single file name or multiple file names separated by commas, semicolons, or newlines.

10-4

Parallel Job Developers Guide

Complex Flat File Stage

Stage Page

For File pattern, type the name of the file that contains the list of files to be imported. You can also use a valid shell expression (in Bourne shell syntax) to generate a list of file names. For Source, type one or more program names and their associated arguments, separated either by semicolons or newlines. For Source list, type the name of a file containing multiple program names. The file must contain program command lines.

Job parameters can be used for one or more file names, or for a file pattern. To specify a job parameter, type the parameter name enclosed in #, such as #JobParameter#. You can specify multiple job parameters by separating the parameter names with commas, semicolons, or newlines. Click the arrow button to browse for an existing job parameter or to define a new one using the Job Properties dialog box. Record type. Select the record type of the source data. The options are:

Fixed. All records are the same length. This is the default. Fixed block. All records are the same length and are grouped in fixed-length blocks. Variable. Records have variable lengths. Variable block. Records have variable lengths and are grouped in variable-length blocks. Variable spanned. Records have variable lengths and may span one or more control interval boundaries within a single control area. Variable block spanned. Records have variable lengths and are grouped in variable-length blocks, where the blocks may span one or more control interval boundaries within a single control area. VR.

If your source file contains OCCURS DEPENDING ON clauses, select Fixed as the record type for non-MVS data sources. Missing file mode. Specifies the action to take if a file does not exist. Select one of the following:

Depends. Stops the job unless the file has a node name prefix of *: in which case the file is skipped. This is the default. Error. Stops the job.

Parallel Job Developers Guide

10-5

Stage Page

Complex Flat File Stage

OK. Skips the file.

Filter. Type a UNIX command to process input files as the data is read from each file, or click the arrow button to insert a job parameter. Filters do not apply to file patterns, source, or source list file types. Multiple node reading. This area determines how files with multiple nodes are read. Select one option:

Read from multiple nodes. Select this box if you want the source file to be read in sections from multiple nodes. This is only allowed for a single file with a record type of fixed or fixed block. Number of readers per node. Specify the number of instances of the import operator on each processing node. The default is one operator per node per input file. If you specify more than one reader, each instance of the file read operator reads a contiguous range of records from the input file. The starting record location in the file for each operator, or seek location, is determined by the data file size, the record length, and the number of instances of the operator as specified. The resulting data set contains one partition per instance of the file read operator, as determined by the number of readers specified. The data file(s) being read must contain fixed-length records.

These options are mutually exclusive with Read first n rows. If the MVS dataset box is selected, these fields are unavailable. Report progress. Select this box to have the stage display a progress report at each 10% interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB, records are fixed length, and there is no filter on the file. The file type must be File(s) or File pattern. Keep file partitions. Select this box to partition the imported data according to the organization of the input file(s). For example, if you are reading three files, you will have three partitions. This means that each file's contents stay in its own partition. Read first n rows. Specifies the number of rows to read from each source file. The default value is 0, which means all rows are read. This option is mutually exclusive with Multiple node reading and does not apply to File pattern, Source or Source list file types.

Target CFF Stages


Target stage file options include settings for the file type and name, record type, write option, reject mode, filter, and cleanup on failure.
10-6 Parallel Job Developers Guide

Complex Flat File Stage

Stage Page

The tab has the following fields: File type. Specifies the type of target file. This determines how your entry in the File name(s) field is interpreted. Select one of the following:

File(s). A single file or multiple files. This is the default. Destination. One or more programs and arguments that read the exported data. Destination list. A file containing the names of multiple programs that provide destinations for the exported data.

File name(s). Type the name of the file that data will be written to, or click the arrow button to search for file names on the server. This field is required, and the specified file must exist unless the write option is Create or Overwrite. Your entry should correspond to your selection in the File type field using these guidelines:

For File(s), type either a single file name or multiple file names separated by commas, semicolons, or newlines. For Destination, type one or more program names and their associated arguments, separated either by semicolons or newlines. For Destination list, type the name of a file containing multiple program names. The file must contain program command lines.

Writing to MVS datasets is not supported.

Parallel Job Developers Guide

10-7

Stage Page

Complex Flat File Stage

To specify a job parameter, type the parameter name enclosed in #, such as #JobParameter#. You can specify multiple job parameters by separating the parameter names with commas, semicolons, or newlines. Click the arrow button to browse for an existing job parameter or to define a new one using the Job Properties dialog box. Record type. Select the record type of the output data. The options are:

Fixed. All records are the same length. This is the default. Fixed block. All records are the same length and are grouped in fixed-length blocks.

Write option. Specifies how to write data to the target file(s). The same method applies to all files being written to. There are three options:

Append. Adds data to the existing file. Create (Error if exists). Creates a new file without checking to see if one already exists. This is the default. If Create is specified for a file that already exists, a runtime error will occur. Overwrite. Deletes the existing file and replaces it with a new file. This is the default.

Reject mode. Specifies the action to take if any records are not written to the target file(s). Select one of the following:

Continue. Continues the operation and discards any rejected rows. This is the default. Fail. Stops writing to the target if any rows are rejected. Save. Sends rejected rows down a reject link. Select this option if a reject link exists.

Cleanup on failure. Select this box to delete any partially written files if the stage fails. If this box is not selected, any partially written files are left. The file type must be File(s). Filter. Type a UNIX command to pass data through a filter program before it is written to the target file(s), or click the arrow button to insert a job parameter. Filters do not apply to Destination or Destination list file types.

Record Options Tab


The Record Options tab allows you to specify properties about the records in the source or target file. The appearance of this tab differs depending on whether the stage is being used as a source or a target.

10-8

Parallel Job Developers Guide

Complex Flat File Stage

Stage Page

Source stage record options include settings for the byte order, character set, data format, record delimiter, and decimals. There is also an option to print the fields to the log file during the import:

Target stage record options include the same settings, plus one for the pad character:

This tab has the following fields: Float representation. Specifies that float fields are represented in IEEE format. This field is read-only.

Parallel Job Developers Guide

10-9

Stage Page

Complex Flat File Stage

Print fields. Appears only when the stage is used as a source. Select this check box to have the names and values of all fields in the schema printed to the log file during the import. Byte order. Specifies how multiple-byte data types (integer, date, time, and timestamp) are ordered. Select from:

Little-endian. The high byte is on the right. Big-endian. The high byte is on the left. Native-endian. As defined by the native format of the machine. This is the default.

Does not apply to string or character data types. Character set. Specifies the character data representation. Select ASCII or EBCDIC (the default). Data format. Specifies the data representation format of a column. Select one of the following:

Binary. Field values are represented in binary format and decimals are represented in packed decimal format. This is the default. Text. Fields are represented as text-based data and decimals are represented in string format.

Pad char. Appears only if the stage is being used as a target. Specifies the pad character used when character or numeric values are exported to an external string representation. Space is the default. Record delimiter. Specifies a delimiter to indicate the end of a record. By default this is empty. Rounding. Specifies how to round a decimal column when writing it. Select one of the following:

Up. Truncate source column towards positive infinity. Down. Truncate source column towards negative infinity. Nearest value. Round the source column towards the nearest representable value. This is the default. Truncate towards zero. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign.

Separator. Specifies the character that acts as the decimal separator. Select Project default to use the value specified at the project level, ,(comma), or .(period). Allow all zeros. Select this to specify that a packed decimal column containing all zeros (which is normally illegal) be treated as a valid representation of zero.
10-10 Parallel Job Developers Guide

Complex Flat File Stage

Stage Page

Columns Tab
Unlike other parallel job file stages, the CFF stage has a Columns tab on the Stage page. This is where you define the actual columns your stage uses. These columns are then projected to the Input page Columns tab, or the Output page Selection tab, depending on whether the stage is being used as a source or a target.
Note You can also define columns by dragging a table definition from the Repository window to the CFF stage icon on the Designer canvas. (This differs from other parallel stages where you drag a table to a link.) You can then propagate source stage columns to one or more output links using the stages shortcut menu.

The Columns tab allows you to define the COBOL file description for the data being read or written by the stage. This file description is then translated to column definitions. This tab contains a columns tree that displays the names of the stage columns, a columns grid with the detailed column definitions, and a properties tree that allows you to set properties for each column. Use the right mouse menu to display or hide these panels to suit your needs. You can load, add, modify, or delete columns here. Click the Load button to load column definitions from a table in the DataStage Repository. You can also enter column definitions directly into the grid. If your column definitions describe array data, you are asked to specify how to handle array data within the stage (see "Complex File Load Options" on page 10-14). Columns displayed here should reflect the actual layout of the file format. If you do not want to display all of the columns, you can specify that unwanted ones be replaced by filler columns. This is done in the Select Columns From Table dialog box when you load table definitions. Fillers can be expanded later if you need to reselect any columns. For more information about fillers, see "Filler Creation and Expansion" on page 10-14). To edit column properties, select a property in the properties tree and use the Value field to make changes. Use the Available properties to add window to add optional attributes to the properties tree.

Parallel Job Developers Guide

10-11

Stage Page

Complex Flat File Stage

This tab contains the following components: Columns tree. Displays the stage column names and record structure in a tree that can be collapsed or expanded using the right mouse menu. Selecting a column in the tree allows you to view or edit its properties in the columns grid or the properties tree. The tree contains four icon types: yellow folders represent group columns, blue folders represent group columns with arrays, single purple rectangles represent simple columns, and double purple rectangles represent columns with arrays. Columns grid. Displays the column definitions for the stage. You can add, modify, or delete column definitions using the right mouse menu. When you select a column definition in the grid, it is highlighted in the columns tree, allowing you to view its location in the record structure. Its properties are also displayed in the properties tree, allowing you to set general and extended column attributes. Properties tree. Displays the currently defined properties for each column. Properties are divided into three categories: General, Extended Attributes, and Derived Attributes. All of the mandatory properties are included in the tree by default and cannot be removed. Optional properties are displayed in the Available properties to add pane for each selected category. To add an optional property to the tree, click on it. You can remove it again by selecting it in the tree and clicking the arrow button. To edit properties, select a property in the tree and use the Value field to make changes. Properties that you must set a value for (i.e.

10-12

Parallel Job Developers Guide

Complex Flat File Stage

Stage Page

which do not have a default value) are shown in the warning color (red by default), but change to black when you have set a value. You can change the warning color from the Options dialog box. Value. Displays the value for the column property selected in the properties tree. You can change the value for general and extended attributes, but not for derived attributes. The method for entering a value changes according to the property you have selected. A description of the property appears in the box below this field. Available properties to add. Displays optional properties for the selected category in the properties tree. Only properties which are not already defined for the column are shown. To add a property to the tree, click on it. You can remove it again by selecting it in the tree and clicking the arrow button. Save As... . When you click the Save As... button, the Save table definition dialog box appears. This dialog box allows you to save a table definition into the DataStage Repository, where you can subsequently reuse it to provide column definitions for other stages. You can also save the table definition as a COBOL file definition (CFD) or DB2 DCLGen file (DFD) file from the same dialog box. Clear All. Click this to clear all column definitions from the stage. Load. Click this to selectively load columns from a table definition in the DataStage Repository:

First, the Table Definitions dialog box appears, allowing you to select an existing table or import a new one. Next, the Select Columns From Table dialog box appears, allowing you to select the columns that you want to load. The Available columns tree displays COBOL structures such as groups and arrays. If you select a subset of columns, fillers can be generated to maintain the byte order of the columns. See "Filler Creation and Expansion" on page 10-14 for details. If there are arrays in the column structure for which flattening is an option, the Complex file load option dialog box appears. See "Complex File Load Options" on page 10-14 for details.

If you load more than one table definition, the list of columns from the subsequent tables is added to the end of the current list. In cases where the first column of the subsequent list has a level number higher than the last column of the current list, Ascential DataStage inserts an 02 FILLER group item before the subsequent list is loaded. (This is not done, however, if the first column being loaded already has a level number of 02.)

Parallel Job Developers Guide

10-13

Stage Page

Complex Flat File Stage

Filler Creation and Expansion


Mainframe table definitions frequently contain hundreds of columns, therefore to save storage space and processing time, there is a Create fillers option in the Select Columns From Table dialog box. This option, which is selected by default, is available only when you load columns from a simple or complex flat file. The sequences of unselected columns are collapsed into FILLER items with the appropriate size. The native data type is set to CHARACTER and the name set to FILLER_XX_YY, where XX is the start offset and YY is the end offset. Fillers for elements of a group array or an OCCURS DEPENDING ON (ODO) column have the name of FILLER_NN, where NN is the element number. The NN begins at 1 for the first unselected group element and continues sequentially. Any fillers that follow an ODO column will also be numbered sequentially. See Appendix C for examples of how fillers are created for different COBOL structures. You can expand fillers in the Columns tree if you want to reselect any columns. Right-click on the filler in the left pane and select Expand Filler... from the shortcut menu. The Expand Filler dialog box appears, allowing you to select some or all of the columns from the given filler. There is no need to reload the table definition and reselect the columns.

Complex File Load Options


When you enter or load column definitions containing arrays in a CFF stage, the stage prompts you for information on how it should handle the array data in the stage. The Complex file load option dialog box appears. If you choose to pass an array as is, the columns with arrays are loaded as is. If you choose to flatten an array, all the elements of the array will appear as separate columns in the table definition. The data is presented as one row at execution time. Each array element is given a numeric suffix to make its name unique. For example, given the following complex flat file structure (in CFD format):
05 05 05 ID NAME CHILD PIC X(10) PIC X(30) PIC X(30) OCCURS 5 TIMES

10-14

Parallel Job Developers Guide

Complex Flat File Stage

Stage Page

You will get the following column definitions:


05 05 05 05 05 05 05 ID NAME CHILD CHILD_2 CHILD_3 CHILD_4 CHILD_5 PIC PIC PIC PIC PIC PIC PIC X(10) X(30) X(30) X(30) X(30) X(30) X(30)

A parallel array is flattened out in the same way. Array columns that have redefined fields or OCCURS DEPENDING ON clauses may not be flattened. Even if you choose to flatten all arrays in the Complex file load option dialog box, these columns are passed as is. The Complex file load option dialog box is as follows:

Options. Select an option to specify how array data will be treated in the stage:

Flatten selective arrays. Allows you to select arrays for flattening on an individual basis. This is the default option. Click on an array in the columns list and use the right mouse button to select Flatten. Columns that cannot be flattened are unavailable for selection. Flatten all arrays. All arrays are flattened. Creates new columns for each element of the arrays. As is. Passes arrays as is.

Description. Gives information about the load option you have chosen.

Parallel Job Developers Guide

10-15

Stage Page

Complex Flat File Stage

Columns. Displays the names of the column definitions and their structure. Array sizes are shown in parentheses. When using the Flatten selective arrays option, right-click on individual column definitions and choose Flatten as required. The array icon changes for the arrays that will be flattened.

Layout Tab
The Layout tab displays the schema format of the column definitions used in the stage. Select a button to view the data representation in one of two formats: Parallel. Displays the OSH record schema. COBOL. Displays the COBOL representation, including the column name, COBOL picture clause, starting and ending offsets, and column storage length. You can use the shortcut menu to save the parallel view as a text file in *.osh format, or the COBOL view as an HTML file.

In the parallel view, the mapping of COBOL native data types to parallel data types is displayed. If there are date masks on columns with CHARACTER native type, the column type is changed to DATE with the date mask translated to the appropriate parallel type. For date masks on columns with DECIMAL or INTEGER native type, the columns are translated to the parallel type using the CFF stages underlying modify operator. For more information about the data type

10-16

Parallel Job Developers Guide

Complex Flat File Stage

Stage Page

conversions that this operator performs, see "Changing Data Type" on page 28-3. In the COBOL view, the storage lengths for a group are the sum of the storage lengths of the individual elements. If an element within the group redefines another element, the element storage length is not included in the group storage length. However, the storage length of the element itself is computed based on its picture clause.

NLS Map Tab


The NLS Map tab allows you to define a character set map for the CFF stage. This is applicable when the native data type for the stage columns is GRAPHIC_N, GRAPHIC_G, VARGRAPHIC_N, or VARGRAPHIC_G. The setting on this tab overrides the default character set map set for the project or the job. You can specify that the map be supplied as a job parameter if required.

Advanced Tab
This tab allows you to specify the following: Execution mode. The execution mode is set automatically and cannot be changed. If the stage is only operating on one file (and there is one reader) the execution mode will be sequential. Otherwise it will be parallel. Combinability mode. This is Auto by default, which allows Ascential DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage.
Parallel Job Developers Guide 10-17

Input Page

Complex Flat File Stage

Preserve partitioning. You can select Set or Clear. If you select Set, it will request that the next stage preserves the partitioning as is. Clear is the default. This only appears if the stage has an output link. Node pool and resource constraints. This option is not applicable to CFF stages. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Input Page
The Input page allows you to specify details about how the CFF stage writes data to a file. The CFF stage can have only one input link. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being written. The Columns tab gives the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the input link. Details about CFF stage Columns tab and Partitioning tab are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the Advanced tab.

Input Link Columns Tab


The Columns tab displays the column definitions for the data coming into the stage, which will then be written out to a complex flat file. You cannot edit the column definitions on this tab, only view them. The columns are defined on the Stage page Columns tab (see "Columns Tab" on page 10-11). The tab contains a columns tree that displays the names of the stage columns, a columns grid with the detailed column definitions, and a properties tree that displays properties for each column. Use the right mouse menu to display or hide these panels to suit your needs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the
10-18 Parallel Job Developers Guide

Complex Flat File Stage

Input Page

target file. It also allows you to specify that the data should be sorted before being written. By default the stage will partition data in Auto mode. If the CFF stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the CFF stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the CFF stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning type dropdown list. This will override any current partitioning. If the CFF stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default Auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the CFF stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag columns. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .

Parallel Job Developers Guide

10-19

Input Page

Complex Flat File Stage

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default method for the CFF stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the target file. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the Auto methods). Select the check boxes as follows: Perform sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled, an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

10-20

Parallel Job Developers Guide

Complex Flat File Stage

Output Page

Output Page
The Output page allows you to specify details about how the CFF stage reads data. The CFF stage can have multiple output links, and each link can read from multiple files. It can also have a single reject link. This is typically used when you are writing to a file and provides a location where records that have failed to be written to a file for some reason can be sent. When you are reading files, you can use a reject link as a destination for rows that do not match the expected column definitions. The Output name drop-down list allows you to choose whether you are looking at details of an output link (stream link) or the reject link. The General tab allows you to specify an optional description of the output link. The Selection tab allows you to select columns to output from the stage. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. Details about CFF stage properties and formatting are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Selection Tab
The Selection tab on the output link allows you to select columns to be output from the stage. The column definitions for the stage are given on the Stage page Columns tab (see "Columns Tab" on page 10-11). You can output all of these on an output link or choose a subset of them.

Parallel Job Developers Guide

10-21

Output Page

Complex Flat File Stage

To select a column for output, copy it from the Available columns tree to the Selected columns list. Groups, elements, and arrays can be selected. Arrays can be kept as is or denormalized. For REDEFINES, you can select the original column, the redefined field, or both. Column icons have a checkmark in the Available columns tree after a column is selected. Click >> to add all columns to the Selected columns list. By default group columns are not included, unless you first select the Enable all group column selection check box. If you select columns out of order, they will be reordered in the Selected columns list to match the structure of the input columns. When you highlight a selected column, the corresponding column is highlighted in the Available columns list. To view the COBOL structure of the selected columns, click View Columns. To go back to the columns list, where you can modify your selections, click Edit Columns. If no columns are selected on this tab, then all stage columns except group columns are automatically propagated to each empty output link when you click OK to exit the stage. The Selection tab is not available for the reject link.

Selecting Array Columns for Output


When you load columns into the CFF stage, you are given three options for handling source data containing arrays. You can pass the

10-22

Parallel Job Developers Guide

Complex Flat File Stage

Output Page

data as is, flatten all arrays on input to the stage, or flatten selected arrays on input. You choose one of these options from the Complex file load option dialog box, which appears when you load column definitions into the Stage Columns tab. If you choose to flatten arrays, the flattening is done at the time the column meta data is loaded into the stage. All of the array elements appear as separate columns in the table. Each array column has a numeric suffix to make its name unique. You can select any or all of these columns for output. If you choose to pass arrays as is, the array structure is preserved. The data is presented as a single row at execution time for each incoming row. If the array is normalized, the incoming single row is resolved into multiple output rows. Following are several cases for normalizing different types of array columns for output. Selecting a Simple Normalized Array Column A simple array is a single, one-dimensional array. This example shows the result when you select all columns as output columns. For each record that is read from the input file, five rows are written to the output link. The sixth row out the link causes the second record to be read from the file, starting the process over again.
Input Record:
05 05 05 ID NAME CHILD PIC X(10) PIC X(30) PIC X(30) OCCURS 5 TIMES.

Output Rows:
Row 1: Row 2: Row 3: Row 4: Row 5: ID ID ID ID ID NAME NAME NAME NAME NAME CHILD(1) CHILD(2) CHILD(3) CHILD(4) CHILD(5)

Selecting a Nested Normalized Array Column This example shows the result when you select a nested array column as output. If you select FIELD-A, FIELD-C and FIELD-D as output

Parallel Job Developers Guide

10-23

Output Page

Complex Flat File Stage

columns, Ascential DataStage multiplies the OCCURS values at each level. In this case, 6 rows are written to the output link.
Input Record:
05 FIELD-A 05 FIELD-B 10 FIELD-C 10 FIELD-D
PIC X(4)

OCCURS 2 TIMES. PIC X(4) PIC X(4) OCCURS 3 TIMES.

Output Rows:
Row 1: Row 2: Row 3: Row 4: Row 5: Row 5: FIELD-A FIELD-A FIELD-A FIELD-A FIELD-A FIELD-A FIELD-C(1) FIELD-D (1,1) FIELD-C(1) FIELD-D (1,2) FIELD-C(1) FIELD-D (1,3) FIELD-C(2) FIELD-D (2,1) FIELD-C(2) FIELD-D (2,2) FIELD-C(2) FIELD-D (2,3)

Selecting Parallel Normalized Array Columns Parallel arrays are array columns at the same level. The first example shows the result when you select all parallel array columns as output columns. Ascential DataStage determines the number of output rows using the largest subscript. As a result, the smallest array gets padded with default values and the element columns get repeated. In this case, if you select all of the input fields as output columns, four rows are written to the output link.
Input Record:
05 05 05 05 05 FIELD-A FIELD-B FIELD-C FIELD-D FIELD-E PIC X(4) PIC X(4) OCCURS 2 TIMES. PIC X(4) PIC X(4) OCCURS 3 TIMES. PIC X(4) OCCURS 4 TIMES.

Output Rows:
Row 1: FIELD-A FIELD-B(1) FIELD-C FIELD-D(1) FIELD-E(1) Row 2: FIELD-A FIELD-B(2) FIELD-C FIELD-D(2) FIELD-E(2) Row 3: FIELD-A Row 4: FIELD-A FIELD-C FIELD-D(3) FIELD-E(3) FIELD-C FIELD-E(4)

10-24

Parallel Job Developers Guide

Complex Flat File Stage

Output Page

In the next example, only a subset of the parallel array columns are selected (FIELD-B and FIELD-E). FIELD-D is passed as is. The number of output rows is determined by the maximum size of the denormalized columns. In this case, four rows are written to the output link.
Output Rows:
Row 1: Row 2: Row 3: Row 4: FIELD-A FIELD-B(1) FIELD-A FIELD-B(2) FIELD-A FIELD-A FIELD-C FIELD-D(1) FIELD-C FIELD-D(1) FIELD-C FIELD-D(1) FIELD-C FIELD-D(1) FIELD-D(2) FIELD-D(2) FIELD-D(2) FIELD-D(2) FIELD-D(3) FIELD-D(3) FIELD-D(3) FIELD-D(3) FIELD-E(1) FIELD-E(2) FIELD-E(3) FIELD-E(4)

Selecting Nested Parallel Denormalized Array Columns This complex scenario shows the result when you select both parallel array fields and nested array fields as output. If you select FIELD-A, FIELD-C, and FIELD-E as output columns in this example, Ascential DataStage determines the number of output rows by using the largest OCCURS value at each level and multiplying them. In this case, three is the largest OCCURS value at the outer (05) level, and five is the largest OCCURS value at the inner (10) level. Therefore, 15 rows are written to the output link. Notice that some of the subscripts repeat. In particular, those that are smaller than the largest OCCURS value at each level start over, including the second subscript of FIELD-C and the first subscript of FIELD-E.
Input Record:
05 FIELD-A 05 FIELD-B 10 FIELD-C 05 FIELD-D 10 FIELD-E PIC X(10) OCCURS 3 TIMES. PIC X(2) OCCURS 4 TIMES. OCCURS 2 TIMES. PIC 9(3) OCCURS 5 TIMES.

Parallel Job Developers Guide

10-25

Output Page

Complex Flat File Stage

Output Rows:
Row 1: Row 2: Row 3: Row 4: Row 5: Row 6: Row 7: Row 8: Row 9: FIELD-A FIELD-A FIELD-A FIELD-A FIELD-A FIELD-A FIELD-A FIELD-A FIELD-A FIELD-C (2,1) FIELD-C (2,2) FIELD-C (2,3) FIELD-C (2,4) FIELD-C (1,1) FIELD-C (1,2) FIELD-C (1,3) FIELD-C (1,4) FIELD-E (1,1) FIELD-E (1,2) FIELD-E (1,3) FIELD-E (1,4) FIELD-E (1,5) FIELD-E (2,1) FIELD-E (2,2) FIELD-E (2,3) FIELD-E (2,4) FIELD-E (2,5) FIELD-C (3,1) FIELD-C (3,2) FIELD-C (3,3) FIELD-C (3,4)

Row 10: FIELD-A Row 11: FIELD-A Row 12: FIELD-A Row 13: FIELD-A Row 14: FIELD-A Row 15: FIELD-A

Selecting Group Columns for Output


Group columns contain elements or subgroups. When you select groups or their elements for output, they are handled in the following manner: If a group column is selected with any of its elements, the group column and the selected element columns are passed as group and element columns. If only elements of the group are selected and not the group column itself, the selected element columns are treated as individual columns. Even if the selected element columns are within multiple or nested groups, all element columns are treated as top-level columns in the selection list on the Selection tab. A group column may not be selected without any of its elements.

Output Link Columns Tab


The Columns tab displays the column definitions for the data to be output on the link. You cannot edit the column definitions on this tab, only view them. The tab contains a columns tree that displays the names of the stage columns, a columns grid with the detailed column definitions, and a

10-26

Parallel Job Developers Guide

Complex Flat File Stage

Output Page

properties tree that displays properties for each column. Use the right mouse menu to display or hide these panels to suit your needs. The columns are defined on the Stage page Columns tab (see "Columns Tab" on page 10-11).

Reject Links
You cannot change the selection properties of a reject link. The Selection tab for a reject link is blank. Similarly, you cannot edit the column definitions for a reject link. For writing files, the link uses the column definitions for the input link. For reading files, the link uses a single column called rejected containing raw data for columns rejected after reading because they do not match the schema.

Parallel Job Developers Guide

10-27

Output Page

Complex Flat File Stage

10-28

Parallel Job Developers Guide

11
SAS Parallel Data Set Stage
The SAS Parallel Data Set stage is a file stage. It allows you to read data from or write data to a parallel SAS data set in conjunction with an SAS stage (described in Chapter 38). The stage can have a single input link or a single output link. It can be configured to execute in parallel or sequential mode. (More information about using Enterprise Edition with SAS is given in SAS Stage Supplementary Guide.) DataStage uses an SAS parallel data set to store data being operated on by an SAS stage in a persistent form. An SAS parallel data set is a set of one or more sequential SAS data sets, with a header file specifying the names and locations of all the component files. By convention, the header file has the suffix .psds.

The stage editor has up to three pages, depending on whether you are reading or writing a data set: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is present when you are writing to a data set. This is where you specify details about the data set being written to.

Parallel Job Developers Guide

11-1

Must Dos

SAS Parallel Data Set Stage

Outputs Page. This is present when you are reading from a data set. This is where you specify details about the data set being read from.

Must Dos
DataStage has many defaults which means that it can be very easy to include SAS Data Set stages in a job. This section specifies the minimum steps to take to get a SAS Data Set stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. The steps required depend on whether you are reading or writing SAS data sets.

Writing an SAS Data Set


In the Input Link Properties Tab:

Specify the name of the SAS data set you are writing to. Specify what happens if a data set with that name already exists (by default this causes an error).

Ensure that column definitions have been specified for the data set (this can be done in an earlier stage).

Reading an SAS Data Set


In the Output Link Properties Tab:

Specify the name of the SAS data set you are reading.

Ensure that column definitions have been specified for the data set (this can be done in an earlier stage).

Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes.

11-2

Parallel Job Developers Guide

SAS Parallel Data Set Stage

Inputs Page

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about how the SAS Data Set stage writes data to a data set. The SAS Data Set stage can have only one input link. The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the data set. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the input link.

Parallel Job Developers Guide

11-3

Inputs Page

SAS Parallel Data Set Stage

Details about SAS Data Set stage properties are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Input Link Properties Tab


The Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and to what data set. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows:
Category/ Property
Target/File Target/Update Policy

Values
pathname Append/Create (Error if exists)/ Overwrite/

Default Mandatory?
N/A Create (Error if exists) Y Y

Repeats?
N N

Dependent of
N/A N/A

Options Category
File The name of the control file for the data set. You can browse for the file or enter a job parameter. By convention the file has the suffix .psds. Update Policy Specifies what action will be taken if the data set you are writing to already exists. Choose from: Append. Append to the existing data set Create (Error if exists). DataStage reports an error if the data set already exists Overwrite. Overwrite any existing file set The default is Create (Error if exists).

11-4

Parallel Job Developers Guide

SAS Parallel Data Set Stage

Inputs Page

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the data set. It also allows you to specify that the data should be sorted before being written. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the SAS Data Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the SAS Data Set stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the SAS Data Set stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type dropdown list. This will override any current partitioning. If the SAS Data Set stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Parallel SAS Data Set stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator.

Parallel Job Developers Guide

11-5

Inputs Page

SAS Parallel Data Set Stage

Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Parallel SAS Data Set stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the data set. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

11-6

Parallel Job Developers Guide

SAS Parallel Data Set Stage

Outputs Page

If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about how the Parallel SAS Data Set stage reads data from a data set. The Parallel SAS Data Set stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the output link. Details about Data Set stage properties and formatting are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Output Link Properties Tab


The Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from the data set. The SAS Data Set stage only has a single property.
Category/ Property
Source/File

Values
pathname

Default Mandatory?
N/A Y

Repeats?
N

Dependent of
N/A

Source Category
File The name of the control file for the parallel SAS data set. You can browse for the file or enter a job parameter. The file has the suffix .psds.

Parallel Job Developers Guide

11-7

Outputs Page

SAS Parallel Data Set Stage

11-8

Parallel Job Developers Guide

12
DB2/UDB Enterprise Stage
The DB2/UDB Enterprise stage is a database stage. It allows you to read data from and write data to a DB2 database. It can also be used in conjunction with a Lookup stage to access a lookup table hosted by a DB2 database (see Chapter 20, "Merge Stage.") DB2 databases distribute data in multiple partitions. DataStage can match the partitioning as it reads or writes data from/to a DB2 database. The DB2/UDB Enterprise stage can have a single input link and a single output reject link, or a single output link or output reference link. The stage performs one of the following operations: Writes to a DB2 table (using INSERT). Updates a DB2 table (using INSERT and/or UPDATE as appropriate). Uses the DB2 CLI to enhance performance. Loads a DB2 table (using DB2 fast loader). Reads a DB2 table. Deletes rows from a DB2 table. Performs a lookup directly on a DB2 table. Loads a DB2 table into memory and then performs a lookup on it. When using an DB2/UDB Enterprise stage as a source for lookup data, there are special considerations about column naming. If you have columns of the same name in both the source and lookup data sets, the source data set column will go to the output data. If you want this column to be replaced by the column from the lookup data source, you need to drop the source data column before you perform the lookup (you could, for example, use a Modify stage to do this). See

DB2/UDB Enterprise Stage

Chapter 20, "Merge Stage," for more details about performing lookups.

When you edit a DB2/UDB Enterprise stage, the stage editor appears. This is based on the generic stage editor described in Chapter 3, "Stage Editors." The stage editor has up to three pages, depending on whether you are reading or writing a database: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is present when you are writing to a DB2 database. This is where you specify details about the data being written. Outputs Page. This is present when you are reading from a DB2 database, or performing a lookup on a DB2 database. This is where you specify details about the data being read.

12-2

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Accessing DB2 Databases

Accessing DB2 Databases


Before using DB2/UDB Enterprise stages for the first time, you should carry out the configuration procedure described in "Configuring for Enterprise Edition" in the DataStage Install and Upgrade Guide. To use DB2/UDB Enterprise stages you must have valid accounts and appropriate privileges on the databases to which they connect. If using DB2 8.1 ESE (Enterprise Server edition), DPF (database partitioning feature) must be installed along with DB2 8.1 ESE, in order to take advantage of DataStage's parallel capabilities. DB2 8.1 ESE with DPF is equivalent to 7.2 EEE. The required DB2 privileges are as follows: SELECT on any tables to be read. INSERT on any existing tables to be updated. TABLE CREATE to create any new tables. INSERT and TABLE CREATE on any existing tables to be replaced. DBADM on any database written by LOAD method. You can grant this privilege in several ways in DB2. One is to start DB2, connect to a database, and grant DBADM privilege to a user, as shown below:
db2> CONNECT TO db_name db2> GRANT DBADM ON DATABASE TO USER user_name

where db_name is the name of the DB2 database and user_name is the login name of the DataStage user. If you specify the message file property, the database instance must have read/write privilege on that file. The users PATH should include $DB2_HOME/bin (e.g., /opt/IBMdb2/ V7.1/bin). The LIBPATH should include $DB2_HOME/lib before any other lib statements (e.g., /opt/IBMdb2/V7.1/lib) The following DB2 environment variables set the run-time characteristics of your system: DB2INSTANCE specifies the user name of the owner of the DB2 instance. DB2 uses DB2INSTANCE to determine the location of db2nodes.cfg. For example, if you set DB2INSTANCE to "Mary", the location of db2nodes. cfg is ~Mary/sqllib/db2nodes.cfg. DB2DBDFT specifies the name of the DB2 database that you want to access from your DB2/UDB Enterprise Stage. There are two other methods of specifying the DB2 database:

Parallel Job Developers Guide

12-3

Accessing DB2 Databases

DB2/UDB Enterprise Stage

1 2

The override database property of the DB2/UDB Enterprise Stage Inputs or Outputs link. The APT_DBNAME environment variable (this takes precedence over DB2DBDFT).

You should normally use the input property Row Commit Interval to specify the number of records to insert into a table between commits (see page 12-25). Previously the environment variable APT_RDBMS_COMMIT_ROWS was used for this, and this is still available for backwards compatibility. You can set this environment variable to any value between 1 and (231 - 1) to specify the number of records. The default value is 2000. If you set APT_RDBMS_COMMIT_ROWS to 0, a negative number, or an invalid value, a warning is issued and each partition commits only once after the last insertion. If you set APT_RDBMS_COMMIT_ROWS to a small value, you force DB2 to perform frequent commits. Therefore, if your program terminates unexpectedly, your data set can still contain partial results that you can use. However, you may pay a performance penalty because of the high frequency of the commits. If you set a large value for APT_RDBMS_COMMIT_ROWS, DB2 must log a correspondingly large amount of rollback information. This, too, may slow your application. If you set neither the Row Commit Interval property, or the APT_RDBMS_COMMIT_ROWS environment variable, the commit interval defaults to 2000.
Note If you are using DB2 7.2, you must ensure that the directory holding the configuration file (as specified by APT_CONFIG_FILE) has the permissions 777.

Remote Connection
You can also connect from a DB2/UDB Enterprise stage to a remote DB2 Server. The connection is made via a DB2 client. In order to remotely connect from a DB2 client to a DB2 server, the DB2 client should be located on the same machine as the DataStage server. Both DB2 client and DB2 server need to be configured for remote connection communication (see your DB2 Database Administrator). The DataStage configuration file needs to contain the node on which DataStage and the DB2 client are installed and the nodes of the remote computer where the DB2 server is installed (see "The Parallel Engine Configuration File").

12-4

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Accessing DB2 Databases

On the DB2/UDB Enterprise stage in your parallel job, you need to set the following properties: Client Instance Name. Set this to the DB2 client instance name. If you set this property, DataStage assumes you require remote connection. Server. Optionally set this to the instance name of the DB2 server. Otherwise use the DB2 environment variable, DB2INSTANCE, to identify the instance name of the DB2 server. Client Alias DB Name. Set this to the DB2 clients alias database name for the remote DB2 server database. This is required only if the clients alias is different from the actual name of the remote server database. Database. Optionally set this to the remote server database name. Otherwise use the environment variables APT_DBNAME or APT_DB2DBDFT to identify the database. User. Enter the user name for connecting to DB2, this is required for a remote connection. Password. Enter the password for connecting to DB2, this is required for a remote connection You can use DataStages remote connection facilities to connect to different DB2 server within the same job. You could, for example, read from a DB2 database on one server, use this data to access a lookup table on another DB2 server, then write any rejected rows to a third DB2 server. Each database would be accessed by a different stage in the job with the Client Instance Name and Server properties set appropriately.

Handling Special Characters (# and $)


The characters # and $ are reserved in DataStage and special steps are needed to handle DB2 databases which use the characters # and $ in column names. DataStage converts these characters into an internal format, then converts them back as necessary. To take advantage of this facility, you need to do the following:

Parallel Job Developers Guide

12-5

Accessing DB2 Databases

DB2/UDB Enterprise Stage

In DataStage Administrator, open the Environment Variables dialog for the project in question, and set the environment variable DS_ENABLE_RESERVED_CHAR_CONVERT to true (this can be found in the General\Customize branch).

Avoid using the strings __035__ and __036__ in your DB2 column names (these are used as the internal representations of # and $ respectively). When using this feature in your job, you should import meta data using the Plug-in Meta Data Import tool, and avoid hand-editing (this minimizes the risk of mistakes or confusion). Once the table definition is loaded, the internal column names are displayed rather than the original DB2 names both in table definitions and in the Data Browser. They are also used in derivations and expressions. The original names are used in generated SQL statements, however, and you should use them if entering SQL in the job yourself. Generally, in the DB2 stage, you enter external names everywhere except when referring to stage column names, where you use names in the form ORCHESTRATE.internal_name. When using the DB2 stage as a target, you should enter external names as follows: For Write and Load options, use external names for select list properties. For Upsert option, for update and insert, use external names when referring to DB2 table column names, and internal names when referring to the stage column names. For example:
INSERT INTO tablename ($A#, ##B$) VALUES (ORCHESTRATE.__036__A__035__, ORCHESTRATE.__035__035__B__036__) UPDATE tablename SET ##B$ = ORCHESTRATE.__035__035__B__036__ WHERE ($A# = ORCHESTRATE.__036__A__035__)

12-6

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Accessing DB2 Databases

When using the DB2 stage as a source, you should enter external names as follows: For Read using the user-defined SQL method, use external names for DB2 columns for SELECT: For example: SELECT #M$, #D$ FROM tablename WHERE (#M$ > 5) For Read using Table method, use external names in select list and where properties. When using the DB2 stage in parallel jobs as a look-up, you should enter external or internal names as follows: For Lookups using the user-defined SQL method, use external names for DB2 columns for SELECT, and for DB2 columns in any WHERE clause you might add. Use internal names when referring to the stage column names in the WHERE clause. For example:
SELECT #M$, #D$ FROM tablename WHERE (#B$ = ORCHESTRATE.__035__ B __036__)

For Lookups using the Table method, use external names in select list and where properties. Use internal names for the key option on the Inputs page Properties tab of the Lookup stage to which the DB2 stage is attached.

Using the Pad Character Property


Use the Pad Character property when using upsert or performing a lookup to pad string and ustring fields that are less than the length of the DB2 CHAR column. Use this property for string and ustring fields that are inserted in DB2 or are used in the WHERE clause of an UPDATE, DELETE, or SELECT statement when all three of these conditions are met:
1 2 3

The UPDATE or SELECT statement contains string or ustring fields that map to CHAR columns in the WHERE clause. The length of the string or ustring field is less than the length of the CHAR column. The padding character for the CHAR columns is not the null terminator.

For example, if you add rows to a table using an INSERT statement in SQL, DB2 automatically pads CHAR fields with spaces. When you subsequently use the DB2/UDB Enterprise stage to update or query the table, you must use the Pad Character property with the value of a space in order to produce the correct results.

Parallel Job Developers Guide

12-7

Accessing DB2 Databases

DB2/UDB Enterprise Stage

When you both insert rows and subsequently update or query them using the DB2/UDB Enterprise stage, you do not need to specify the Pad Character property.The stage automatically pads with null terminators, and the default pad character for the stage is the null terminator.

Type Conversions - Writing to DB2/UDB


When writing or loading, the DB2/UDB Enterprise stage automatically converts DataStage data types to DB2/UDB data types as shown in the following table:
DataStage SQL Data Type
Date Time Timestamp Decimal Numeric TinyInt SmallInt Integer Float Real Double Unknown Char

Underlying Data Type DB2/UDB Data Type


date time timestamp decimal (p, s) int8 int16 int32 sfloat dfloat fixed-length string in the form string[n] and ustring[n]; length <= 254 bytes fixed-length string in the form string[n] and ustring[n]; 255 < = length <= 4000 bytes variable-length string, in the form string[max=n] and ustring[max=n]; maximum length <= 4000 bytes DATE TIME TIMESTAMP DECIMAL (p, s) SMALLINT SMALLINT INTEGER FLOAT FLOAT CHAR(n) where n is the string length VARCHAR(n) where n is the string length VARCHAR(n) where n is the maximum string length

LongVarChar VarChar

LongVarChar VarChar

LongVarChar VarChar LongVarChar VarChar

variable-length string in VARCHAR(32)* the form string and ustring string and ustring, 4000 bytes < length Not supported

12-8

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Accessing DB2 Databases

The default length of VARCHAR is 32 bytes. That is, 32 bytes are allocated for each variable-length string field in the input data set. If an input variable-length string field is longer than 32 bytes, the stage issues a warning.

Type Conversions - Reading from DB2/UDB


When reading, the DB2/UDB Enterprise stage automatically converts DB2/UDB data types to DataStage data types as shown in the following table:
DataStage SQL Data Type
Time or Timestamp

Underlying Data Type DB2/UDB Data Type


time or timestamp with corresponding fractional precision for time If the DATETIME starts with a year component, the result is a timestamp field. If the DATETIME starts with an hour, the result is a time field. DATETIME

Decimal Numeric

decimal (p, s) where p is the precision and s is the scale The maximum precision is 32, and a decimal with floating scale is converted to a dfloat

DECIMAL (p, s)

TinyInt SmallInt Integer Double Float Real Float Real Double Decimal

int8 int16 int32 dfloat sfloat sfloat dfloat decimal

SMALLINT SMALLINT INTEGER FLOAT SMALLFLOAT REAL DOUBLE-PRECISION MONEY

Parallel Job Developers Guide

12-9

Examples

DB2/UDB Enterprise Stage

DataStage SQL Data Type


Unknown Char LongVarChar VarChar NChar NVarChar LongNVarChar Unknown Char LongVarChar VarChar NChar NVarChar LongNVarChar Unknown Char LongVarChar VarChar NChar NVarChar LongNVarChar

Underlying Data Type DB2/UDB Data Type


string[n] or ustring[n] NCHAR(n, r)

string[max = n] or ustring[max = n]

NVARCHAR(n, r)

string[max = n] or ustring[max = n]

VARCHAR(n)

Examples
Looking Up a DB2/UDB Table
This example shows what happens when data is looked up in a DB2/ UDB table. The stage in this case will look up the interest rate for each customer based on the account type. Here is the data that arrives on the primary link:
Customer
Latimer Ridley Cranmer Hooper Moore

accountNo
7125678 7238892 7611236 7176672 7146789

accountType
plat flexi gold flexi gold

balance
7890.76 234.88 1288.00 3456.99 424.76

12-10

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Examples

Here is the data in the DB2/UDB lookup table:


accountType InterestRate
bronze silver gold plat flexi fixterm 1.25 1.50 1.75 2.00 1.88 3.00

Here is what the lookup stage will output:


Customer
Latimer Ridley Cranmer Hooper Moore

accountNo
7125678 7238892 7611236 7176672 7146789

accountType balance InterestRate


plat flexi gold flexi gold 7890.76 234.88 1288.00 3456.99 424.76 2.00 1.88 1.75 1.88 1.75

The job looks like the one illustrated on page 12-2. The Data_set stage provides the primary input, DB2_lookup_table provides the lookup data, Lookup_1 performs the lookup and outputs the resulting data to Data_Set_3. In the DB2/UDB stage we specify that we are going to look up the data directly in the DB2/UDB database, and the name of the table we are going to look up. In the Look up stage we specify the column that we are using as the key for the look up.

Parallel Job Developers Guide

12-11

Examples

DB2/UDB Enterprise Stage

The properties for the DB2/UDB stage are as follows:

The properties for the look up stage are as follows:

Updating a DB2/UDB Table


This example shows a DB2/UDB table being updated with three new columns. The database records the horse health records of a large stud. Details of the worming records are being added to the main table and populated with the most recent data, using the existing

12-12

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Examples

column name as a key. The meta data for the new columns is as follows:

We are going to specify upsert as the write method and choose Userdefined Update & Insert as the upsert mode, this is so that we do not include the existing name column in the INSERT statement. The properties (showing the INSERT statement) are shown below. The INSERT statement is as generated by the DataStage, except the name column is removed.

Parallel Job Developers Guide

12-13

Must Dos

DB2/UDB Enterprise Stage

The UPDATE statement is as automatically generated by DataStage:

Must Dos
DataStage has many defaults which means that it can be very easy to include DB2/UDB Enterprise stages in a job. This section specifies the minimum steps to take to get a DB2/UDB Enterprise stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. The steps required depend on what you are using a DB2/UDB Enterprise Stage for.

Writing a DB2 Database


In the Input Link Properties Tab:

Choose a Write Method of Write. Specify the Table you are writing. If you are not using environment variables to specify the server and database (as described in "Accessing DB2 Databases" on page 12-3), set Use Database Environment Variable and Use Server Environment Variable to False, and supply values for the Database and Server properties.

By default the stage uses the same partitioning method as the DB2 table defined by the environment variables (see "Accessing DB2 Databases" on page 12-3). The method can be changed, or you can specify a different database, on the Input Link Partitioning Tab. Ensure column meta data has been specified for the write.

12-14

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Must Dos

Updating a DB2 Database


This is the same as writing a DB2 database, except you need to specify details of the SQL statements used to update the database: In the Input Link Properties Tab:

Choose a Write Method of Upsert. Choose the Upsert Mode, this allows you to specify whether to insert and update, or update only, and whether to use a statement automatically generated by DataStage or specify your own. If you have chosen an Upsert Mode of User-defined Update and Insert, specify the Insert SQL statement to use. DataStage provides the auto-generated statement as a basis, which you can edit as required. If you have chosen an Upsert Mode of User-defined Update and Insert or User-defined Update only, specify the Update SQL statement to use. DataStage provides the auto-generated statement as a basis, which you can edit as required. If you want to send rejected rows down a rejects link, set Output Rejects to True (it is false by default).

Deleting Rows from a DB2 Database


This is the same as writing a DB2 database, except you need to specify details of the SQL statements used to delete rows from the database: In the Input Link Properties Tab:

Choose a Write Method of Delete Rows. Choose the Delete Rows Mode, this allows you to specify whether to use a statement automatically generated by DataStage or specify your own. If you have chosen a Delete Rows Mode of User-defined delete, specify the Delete SQL statement to use. DataStage provides the auto-generated statement as a basis, which you can edit as required. If you want to send rejected rows down a rejects link, set Output Rejects to True (it is false by default).

Loading a DB2 Database


This is the default method. Loading has the same requirements as writing, except:

Parallel Job Developers Guide

12-15

Must Dos

DB2/UDB Enterprise Stage

In the Input Link Properties Tab:

Choose a Write Method of Load.

Reading a DB2 Database


In the Output Link Properties Tab:

Choose a Read Method. This is Table by default (which reads directly from a table and operates in parallel), but you can also choose to read using auto-generated SQL or user-generated SQL (which operates sequentially on a single node by default). Specify the table to be read. If using a Read Method of user-generated SQL, specify the SELECT SQL statement to use. DataStage provides the autogenerated statement as a basis, which you can edit as required. If using a Read Method apart from Table, you can specify a Partition Table property. This specifies execution of the query in parallel on the processing nodes containing a partition derived from the named table. If you do not specify this, the stage executes the query sequentially on a single node. If you are not using environment variables to specify the server and database (as described in "Accessing DB2 Databases" on page 12-3), set Use Database Environment Variable and Use Server Environment Variable to False, and supply values for the Database and Server properties.

Ensure column meta data has been specified for the read.

Performing a Direct Lookup on a DB2 Database Table


Connect the DB2/UDB Enterprise Stage to a Lookup stage using a reference link. In the Output Link Properties Tab:

Set the Lookup Type to Sparse. Choose a Read Method. This is Table by default (which reads directly from a table), but you can also choose to read using auto-generated SQL or user-generated SQL. Specify the table to be read for the lookup. If using a Read Method of user-generated SQL, specify the SELECT SQL statement to use. DataStage provides the autogenerated statement as a basis, which you can edit as required. You would use this if, for example, you wanted to perform a non-equality based lookup.

12-16

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Stage Page

If you are not using environment variables to specify the server and database (as described in "Accessing DB2 Databases" on page 12-3), set Use Database Environment Variable and Use Server Environment Variable to False, and supply values for the Database and Server properties.

Ensure column meta data has been specified for the lookup.

Performing an In Memory Lookup on a DB2 Database Table


This is the default method. It has the same requirements as a direct lookup, except: In the Output Link Properties Tab:

Set the Lookup Type to Normal.

Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes. The NLS Map tab appears if you have NLS enabled on your system, it allows you to specify a character set map for the stage.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire write is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. You can select Set or Clear. If you select Set file read operations will request that the next stage preserves the partitioning as is (it does not appear if your stage only has an input link).

Parallel Job Developers Guide

12-17

Inputs Page

DB2/UDB Enterprise Stage

Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).
Note This page is blank if you are using the stage to perform a lookup directly on DB2 table (i.e. operating in sparse mode).

NLS Map Tab


The NLS Map tab allows you to define a character set map for the DB2/UDB Enterprise stage. This overrides the default character set map set for the project or the job. You can specify that the map be supplied as a job parameter if required.

Inputs Page
The Inputs page allows you to specify details about how the DB2/ UDB Enterprise Stage writes data to a DB2 database. The DB2/UDB Enterprise Stage can have only one input link writing to one table.

12-18

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Inputs Page

The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the database. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about DB2/UDB Enterprise Stage properties, partitioning, and formatting are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Input Link Properties Tab


The Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and where. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes (the properties for stages in jobs being deployed on USS systems are slightly different see page 12-28 for details). A more detailed description of each property follows.
Category/ Property
Target/Table Target/Delete Rows Mode

Values
String Autogenerated delete/userdefined delete String

Default
N/A Autogenerated delete

Mandatory?
Y Y if Write method = Delete Rows

Repeats?
N N

Dependent of
N/A N/A

Target/Delete SQL

N/A

Y if Write method = Delete Rows

N/A

Parallel Job Developers Guide

12-19

Inputs Page

DB2/UDB Enterprise Stage

Category/ Property

Values

Default
Autogenerated Update & Insert

Mandatory?
Y if Write method = Upsert

Repeats?
N

Dependent of
N/A

Target/Upsert Mode Autogenerated Update & Insert/ Autogenerated Update Only/Userdefined Update & Insert/Userdefined Update Only Target/Insert SQL String

N/A

Y if Write method = Upsert Y if Write method = Upsert Y

N/A

Target/Update SQL

String

N/A

N/A

Target/Write Method

Delete Rows/Write/ Load/ Upsert Append/ Create/ Replace/ Truncate True/False True/False string

Load

N/A

Target/Write Mode

Append

N/A

Connection/Use Default Database Connection/Use Default Server Connection/ ServerDatabase

True True N/A

Y Y Y (if Use Database environment variable = False) Y (if Use Server environment variable = False) N

N N N

N/A N/A N/A

Connection/Server

string

N/A

N/A

Connection/Client Instance Name

string

N/A

N/A

12-20

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Inputs Page

Category/ Property
Options/Array Size

Values
number

Default
2000

Mandatory?
Y (if Write Method = Delete) Y (if Write Method = Upsert) N N Y

Repeats?
N

Dependent of
N/A

Options/Output Rejects Options/Row Commit Interval Options/Time Commit Interval Options/Silently Drop Columns Not in Table Options/Truncate Column Names Options/Truncation Length Options/Close Command Options/Default String Length Options/Open Command Options/Use ASCII Delimited Format

True/False

False

N/A

number number True/False

value of Array Size 2 False

N N N

N/A N/A N/A

True/False number string number string True/False

False 18 N/A 32 N/A False

Y N N N N Y (if Write Method = Load) Y (if Write Method = Load) N N N

N N N N N N

N/A Truncate Column Names N/A N/A N/A N/A

Options/Cleanup on True/False Failure Options/Message File pathname

False

N/A

N/A N/A False

N N N

N/A N/A N/A

Options/DB Options string Options/Nonrecoverable Transactions Options/Pad Character Options/Exception Table True/False

string string

null N/A

N N

N N

N/A N/A

Parallel Job Developers Guide

12-21

Inputs Page

DB2/UDB Enterprise Stage

Category/ Property
Options/Statistics

Values

Default

Mandatory?
N

Repeats?
N

Dependent of
N/A

stats_none stats_none/ stats_exttabl e_only/ stats_extind ex_only/ stats_index/ stats_table/ stats_extind ex_table/ stats_all/ stats_both 1 True

Options/Number of number Processes per Node Options/Arbitrary Loading Order True/False

N N

N N Number of Processes per Node

Target Category
Table Specify the name of the table to write to. You can specify a job parameter if required. Delete Rows Mode This only appears for the Delete Rows write method. Allows you to specify how the delete statement is to be derived. Choose from: Auto-generated Delete. DataStage generates a delete statement for you, based on the values you have supplied for table name and column details. The statement can be viewed by selecting the Delete SQL property. User-defined Delete. Select this to enter your own delete statement. Then select the Delete SQL property and edit the statement proforma. Delete SQL Only appears for the Delete Rows write method. This property allows you to view an auto-generated Delete statement, or to specify your own (depending on the setting of the Delete Rows Mode property).

12-22

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Inputs Page

Upsert Mode This only appears for the Upsert write method. Allows you to specify how the insert and update statements are to be derived. Choose from: Auto-generated Update & Insert. DataStage generates update and insert statements for you, based on the values you have supplied for table name and on column details. The statements can be viewed by selecting the Insert SQL or Update SQL properties. Auto-generated Update Only. DataStage generates an update statement for you, based on the values you have supplied for table name and on column details. The statement can be viewed by selecting the Update SQL properties. User-defined Update & Insert. Select this to enter your own update and insert statements. Then select the Insert SQL and Update SQL properties and edit the statement proformas. User-defined Update Only. Select this to enter your own update statement. Then select the Update SQL property and edit the statement proforma. Insert SQL Only appears for the Upsert write method. This property allows you to view an auto-generated Insert statement, or to specify your own (depending on the setting of the Update Mode property). Update SQL Only appears for the Upsert write method. This property allows you to view an auto-generated Update statement, or to specify your own (depending on the setting of the Update Mode property). Write Method Choose from Delete Rows, Write, Upsert, or Load (the default). Load takes advantage of fast DB2 loader technology for writing data to the database. Upsert uses Insert and Update SQL statements to write to the database. (Upsert is not available when you are using the DB2 load stage on a USS system.) Write Mode Select from the following: Append. This is the default. New records are appended to an existing table.

Parallel Job Developers Guide

12-23

Inputs Page

DB2/UDB Enterprise Stage

Create. Create a new table. If the DB2 table already exists an error occurs and the job terminates. You must specify this mode if the DB2 table does not exist. Replace. The existing table is first dropped and an entirely new table is created in its place. DB2 uses the default partitioning method for the new table. Note that you cannot create or replace a table that has primary keys, you should not specify primary keys in your meta data. Truncate. The existing table attributes (including schema) and the DB2 partitioning keys are retained, but any existing records are discarded. New records are then appended to the table.

Connection Category
Use Default Server This is set to True by default, which causes the stage to use the setting of the DB2INSTANCE environment variable to derive the server. If you set this to False, you must specify a value for the Override Server property. Use Default Database This is set to True by default, which causes the stage to use the setting of the environment variable APT_DBNAME, if defined, and DB2DBDFT otherwise to derive the database. If you set the property to False, you must specify a value for the Override Database property. Server Optionally specifies the DB2 instance name for the table. This property appears if you set Use Server Environment Variable property to False. Database Optionally specifies the name of the DB2 database to access. This property appears if you set Use Database Environment Variable property to False. Client Instance Name This property is only required if you are connecting to a remote DB2 server. It specifies the DB2 client through which you are making the connection (see "Remote Connection" on page 12-4).
Note Connection details are normally specified by environment variables as described in "Accessing DB2 Databases" on
12-24 Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Inputs Page

page 12-3. If you are specifying a remote connection, when you fill in the client instance name, user and password fields appear and allow you to specify these for connection to the remote server.

Options Category
Array Size This is only available for Write Methods of Delete and Upsert, and is optional for upsert. This specifies the size the insert/delete host array. It defaults to 2000, but you can enter 1 if you want each insert/delete statement to be executed individually. Output Rejects This appears for the Upsert Write Method. It specifies how to handle rows that fail to be inserted. Choose True to send them down a reject link, or False to drop them. Row Commit Interval This is available for Write Methods of Upsert, Delete Rows, and Write. It specifies the number of records that should be committed before starting a new transaction. The specified number must be a multiple of the array size. For Upsert and Delete Rows, the default is the array size (which in turn defaults to 2000). For Write the default is 2000. If you set a small value for Row Commit Interval, you force DB2 to perform frequent commits. Therefore, if your program terminates unexpectedly, your data set can still contain partial results that you can use. However, you may pay a performance penalty because of the high frequency of the commits. If you set a large value for Row Commit Interval, DB2 must log a correspondingly large amount of rollback information. This, too, may slow your application. Time Commit Interval This is available for Write Methods of Upsert and Delete. It specifies the number of seconds DataStage should allow between committing the input array and starting a new transaction. The default time period is 2 seconds Silently Drop Columns Not in Table This is False by default. Set to True to silently drop all input columns that do not correspond to columns in an existing DB2 table. Otherwise the stage reports an error and terminates the job.

Parallel Job Developers Guide

12-25

Inputs Page

DB2/UDB Enterprise Stage

Truncate Column Names Select this option to truncate column names to 18 characters. To specify a length other than 18, use the Truncation Length dependent property: Truncation Length This is set to 18 by default. Change it to specify a different truncation length. Close Command This is an optional property. Use it to specify any command to be parsed and executed by the DB2 database on all processing nodes after the stage finishes processing the DB2 table. You can specify a job parameter if required. Default String Length This is an optional property and is set to 32 by default. Sets the default string length of variable-length strings written to a DB2 table. Variablelength strings longer than the set length cause an error. The maximum length you can set is 4000 bytes. Note that the stage always allocates the specified number of bytes for a variable-length string. In this case, setting a value of 4000 allocates 4000 bytes for every string. Therefore, you should set the expected maximum length of your largest string and no larger. Open Command This is an optional property. Use it to specify any command to be parsed and executed by the DB2 database on all processing nodes before the DB2 table is opened. You can specify a job parameter if required. Use ASCII Delimited Format This property only appears if Write Method is set to Load. Specify this option to configure DB2 to use the ASCII-delimited format for loading binary numeric data instead of the default ASCII-fixed format. This option can be useful when you have variable-length columns, because the database will not have to allocate the maximum amount of storage for each variable-length column. However, all numeric columns are converted to an ASCII format by DB2, which is a CPUintensive operation. See the DB2 reference manuals for more information.

12-26

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Inputs Page

Cleanup on Failure This property only appears if Write Method is set to Load. Specify this option to deal with failures during stage execution that leave the tablespace being loaded in an inaccessible state. The cleanup procedure neither inserts data into the table nor deletes data from it. You must delete rows that were inserted by the failed execution either through the DB2 command-level interpreter or by using the stage subsequently using the replace or truncate write modes. Message File This property only appears if Write Method is set to Load. Specifies the file where the DB2 loader writes diagnostic messages. The database instance must have read/write privilege to the file. DB Options This only appears if Write Method is set to load and Write Mode is set to Create or Replace. It specifies an optional table space or partitioning key to be used by DB2 to create the table. By default, DataStage creates the table on all processing nodes in the default table space and uses the first column in the table, corresponding to the first field in the input data set, as the partitioning key. You specify arguments as a string enclosed in braces in the form:
{tablespace=t_space,[key=col0,...]}

Non-recoverable Transactions This only appears if Write Method is set to Load. It is False by default. If set to True, it indicates that your load transaction is marked as nonrecoverable. It will not be possible to recover your transaction with a subsequent roll forward action. The roll forward utility will skip the transaction, and will mark the table into which data was being loaded as "invalid". The utility will also ignore any subsequent transactions against the table. After a roll forward is completed, the table can only be dropped. Table spaces are not put in a backup pending state following the load operation, and a copy of the loaded data is not made during the load operation. Pad Character This appears for a Write Method of Upsert or Delete Rows. It specifies the padding character to be used in the construction of a WHERE clause when it contains string columns that have a length less than

Parallel Job Developers Guide

12-27

Inputs Page

DB2/UDB Enterprise Stage

the DB2 char column in the database. It defaults to null. (See "Using the Pad Character Property" on page 12-7.) Exception Table This property only appears if Write Method is set to Load. It allows you to specify the name of a table where rows that violate load table constraints are inserted. The table needs to have been created in the DB2 database. The exception table cannot be used when the Write Mode is set to create or replace. Statistics This property only appears if Write Method is set to Load. It allows you to specify which statistics should be generated upon load completion, as part of the loading process DB2 will collect the requisite statistics for table optimization. This option is only valid for a Write Mode of truncate, it is ignored otherwise. Number of Processes per Node This property only appears if Write Method is set to Load. It allows you to specify the number of processes to initiate on every node. If set to 0, the stage uses its own algorithm to determine the optimal number, based on the number of CPUs available at runtime (this does not, however, take into account the workload from the rest of the job). By default it is set to 1. It has the following dependent property: Arbitrary Loading Order This only appears if Number of Processes per Node is set to a value greater than 1. If set true, it specifies that the loading of every node can be arbitrary, leading to a potential performance gain.

USS Options
If you are designing jobs within a USS deployment project (see Chapter 56, "Parallel Jobs on USS,"), the properties available under the Connection and Options categories are different, and there is an extra category: MVS Datasets. The following table describes the properties available for these categores; see page 12-22 for the properties available under the target category.
Category/ Property
Connection/Use Default Database

Values
True/False

Default
True

Mandatory?
Y

Repeats?
N

Dependent of
N/A

12-28

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Inputs Page

Category/ Property
Connection/ ServerDatabase

Values
string

Default
N/A

Mandatory?

Repeats?

Dependent of
N/A

Y (if Use N Database environment variable = False) Y (if Write Method = Load) Y (if Write Method = Load) Y (if Write Method = Load) Y (if Write Method = Load or Write) Y (if Write Method = Load or Write) N Y (if Write Method = Load) N N N N N N N N N

Options/Enforce Constraints Options/Keep Dictionary Options/Preformat Options/Silently Drop Columns Not in Table Options/Truncate Column Names Options/Truncation Length Options/Verbose Options/Close Command Options/Default String Length Options/Exception Table

True/False True/False True/False True/False

False False False False

N/A N/A N/A N/A

True/False

False

N/A

number True/False string number string

18 False N/A 32 N/A 1 True

N N N N N N N

Truncate Column Names N/A N/A N/A N/A

Options/Number of number Processes per Node Options/Arbitrary Loading Order Options/Open Command Options/Row Estimate True/False

Number of Processes per Node N/A N/A N/A N/A N/A

string integer

N/A N/A N/A N/A N/A

N N N N N

N N N N N

Options/Sort Device string Type Options/Sort Keys Options/When Clause integer string

Parallel Job Developers Guide

12-29

Inputs Page

DB2/UDB Enterprise Stage

Category/ Property
Options/Create Statement

Values
True/False

Default
False

Mandatory?
Y (if Write Method = Load and Write Mode = Create) N Y (if Write Method = Load and Write Mode = Replace) N

Repeats?
N

Dependent of
N/A

Options/DB Options string Options/Reuse Datasets True/False

N/A False

N N

N/A N/A

Options/Statistics

stats_none stats_all/ stats_both/ stats_extin dex_only/ stats_extin dex_table/ stats_extta ble_only/ stats_index / stats_none/ stats_table number 2000

N/A

Options/Array Size

Y (if Write Method = Delete) N N N Y (if Write Method = Upsert)

N/A

Options/Pad Character Options/Row Commit Interval Options/Time Commit Interval Options/Output Rejects

string number number True/False

null value of Array Size 2 False

N N N N

N/A N/A N/A N/A

Connection Category
Use Default Database This is set to True by default, which causes the stage to use the default DB2 subsystem. If you set the property to False, you must specify a value for the Override Database property.

12-30

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Inputs Page

Database Optionally specifies the name of the DB2 database to access. This property appears if you set Use Database Environment Variable property to False.

MVS DataSets Category


Discard DSN Specifies the name of the MVS dataset that stores the rejected records. It has the following sub-properties: Discard Device Type The device type that is used for the specified discard dataset. Discard Space The primary allocation space for the discard dataset, specified in cylinders. Max Discards Per Node An integer which specifies the maximum number of discarded rows to keep in a dataset per node. Error DSN The name of the MVS dataset that stores rows that could not be loaded into DB2 because of an error. It has the following subproperties: Error Device Type The device type that is used for the specified Error dataset. Error Space The primary allocation space for the error dataset, specified in cylinders. Map DSN specifies the name of the MVS dataset for mapping identifiers back to the input records that caused an error. It has the following subproperties: Map Device Type The device type that is used for the specified Map dataset.

Parallel Job Developers Guide

12-31

Inputs Page

DB2/UDB Enterprise Stage

Map Space The primary allocation space for the map dataset, specified in cylinders. Work 1 DSN Specifies the name of the MVS dataset for sorting input. It has the following sub-properties: Work 1 Device Type The device type that is used for the specified Work 1 dataset. Work 1 Space The primary allocation space for the Work 1 dataset, specified in cylinders. Work 2 DSN Specifies the name of the MVS dataset for sorting output. It has the following sub-properties: Work 2 Device Type The device type that is used for the specified Work 2 dataset. Work 2 Space The primary allocation space for the Work 2 dataset, specified in cylinders.

Options Category
Enforce Constraints Only available when Write Method = Load. If this is set to True, load will delete errant rows when encountering them, and issue a message identifying such row. This requires that: referential constraints exist the input must be sorted a Map DSN dataset must be specified under the MVS datasets category. Keep Dictionary Only available when Write Method = Load. If this is set to true, load is prevented from building a new compresiion dictionary. This property is ignored unless the associated tablespace has the COMPRESS YES attribute.
12-32 Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Inputs Page

Preformat Only available when Write Method = Load. If set to True, the remaining pages are preformatted in the tablespace and its index space. Silently Drop Columns Not in Table This is False by default. Set to True to silently drop all input columns that do not correspond to columns in an existing DB2 table. Otherwise the stage reports an error and terminates the job. Truncate Column Names Select this option to truncate column names to 18 characters. To specify a length other than 18, use the Truncation Length dependent property: Truncation Length This is set to 18 by default. Change it to specify a different truncation length. Verbose Only available when Write Method = Load. If this is set to True, DataStage logs all messages generated by DB2 when a record is rejected because of prime key or other violations. Close Command This is an optional property. Use it to specify any command to be parsed and executed by the DB2 database on all processing nodes after the stage finishes processing the DB2 table. You can specify a job parameter if required. Default String Length This is an optional property and is set to 32 by default. Sets the default string length of variable-length strings written to a DB2 table. Variablelength strings longer than the set length cause an error. The maximum length you can set is 4000 bytes. Note that the stage always allocates the specified number of bytes for a variable-length string. In this case, setting a value of 4000 allocates 4000 bytes for every string. Therefore, you should set the expected maximum length of your largest string and no larger.

Parallel Job Developers Guide

12-33

Inputs Page

DB2/UDB Enterprise Stage

Exception Table This property only appears if Write Method is set to Load. It allows you to specify the name of a table where rows that violate load table constraints are inserted. The table needs to have been created in the DB2 database. The exception table cannot be used when the Write Mode is set to create or replace. Number of Processes per Node This property only appears if Write Method is set to Load. It allows you to specify the number of processes to initiate on every node. If set to 0, the stage uses its own algorithm to determine the optimal number, based on the number of CPUs available at runtime (this does not, however, take into account the workload from the rest of the job). By default it is set to 1. It has the following dependent property: Arbitrary Loading Order This only appears if Number of Processes per Node is set to a value greater than 1. If set true, it specifies that the loading of every node can be arbitrary, leading to a potential performance gain. Open Command This is an optional property. Use it to specify any command to be parsed and executed by the DB2 database on all processing nodes before the DB2 table is opened. You can specify a job parameter if required. Row Estimate Only available when Write Method = Load. Specify the estimated number of rows (across all nodes) to be loaded into the database. An estimate of the required primary allocation space for storing all rows is made before load is engaged. Sort Device Type Only available when Write Method = Load. Specify the device type for dynamically allocated datasets used by DFSORT. Sort Keys Only available when Write Method = Load. Set this to have rows presorted according to keys, the value is an estimate of the number of index keys to be sorted. Do not use this property if tablespace does not have an indes, has only one index, or data is already sorted according to index keys.

12-34

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Inputs Page

When Clause Only available when Write Method = Load. Specify a WHEN clause for the load script. Create Statement Only available when Write Method = Load and Write Mode = Create or Replace. Specify the SQL statement to create the table. DB Options This only appears if Write Method is set to load and Write Mode is set to Create or Replace. It specifies an optional table space or partitioning key to be used by DB2 to create the table. By default, DataStage creates the table on all processing nodes in the default table space and uses the first column in the table, corresponding to the first field in the input data set, as the partitioning key. You specify arguments as a string enclosed in braces in the form:
{tablespace=t_space,[key=col0,...]}

Reuse Datasets This only appears if Write Method is set to Load and Write Mode is set to Replace. If True, DB2 reuses DB2 managed datasets without relocating them. Statistics This only appears if Write Method is set to load and Write Mode is set to Truncate. Specifies which statistics should be generated upon completion of load. As a part of the loading process, DB2 collects the statistics required for table access optimization (alternatively use the RUNSTAT utility). Array Size This is only available for Write Methods of Delete and Upsert, and is optional for upsert. This specifies the size the insert/delete host array. It defaults to 2000, but you can enter 1 if you want each insert/delete statement to be executed individually. Pad Character This appears for a Write Method of Upsert or Delete Rows. It specifies the padding character to be used in the construction of a WHERE clause when it contains string columns that have a length less than

Parallel Job Developers Guide

12-35

Inputs Page

DB2/UDB Enterprise Stage

the DB2 char column in the database. It defaults to null. (See "Using the Pad Character Property" on page 12-7.) Row Commit Interval This is available for Write Methods of Upsert, Delete Rows, and Write. It specifies the number of records that should be committed before starting a new transaction. The specified number must be a multiple of the array size. For Upsert and Delete Rows, the default is the array size (which in turn defaults to 2000). For Write the default is 2000. If you set a small value for Row Commit Interval, you force DB2 to perform frequent commits. Therefore, if your program terminates unexpectedly, your data set can still contain partial results that you can use. However, you may pay a performance penalty because of the high frequency of the commits. If you set a large value for Row Commit Interval, DB2 must log a correspondingly large amount of rollback information. This, too, may slow your application. Time Commit Interval This is available for Write Methods of Upsert and Delete. It specifies the number of seconds DataStage should allow between committing the input array and starting a new transaction. The default time period is 2 seconds Output Rejects This appears for the Upsert Write Method. It specifies how to handle rows that fail to be inserted. Choose True to send them down a reject link, or False to drop them.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the DB2 database. It also allows you to specify that the data should be sorted before being written. By default the stage partitions in DB2 mode. This takes the partitioning method from a selected DB2 database (or the one specified by the environment variables described in "Accessing DB2 Databases" on page 12-3). If the DB2/UDB Enterprise Stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:
12-36 Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Inputs Page

Whether the DB2/UDB Enterprise Stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the DB2/UDB Enterprise Stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the DB2/UDB Enterprise Stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default Auto collection method. The following partitioning methods are available: Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag columns. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of the specified DB2 table. This is the default method for the DB2/UDB Enterprise Stage. Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button The following Collection methods are available: (Auto). This is the default collection method for DB2/UDB Enterprise Stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

Parallel Job Developers Guide

12-37

Outputs Page

DB2/UDB Enterprise Stage

Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the file or files. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about how the DB2/ UDB Enterprise Stage reads data from a DB2 database. The DB2/UDB Enterprise Stage can have only one output link. Alternatively it can have a reference output link, which is used by the Lookup stage when referring to a DB2 lookup table. It can also have a reject link where rejected records are routed (used in conjunction with an input link). The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of

12-38

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Outputs Page

exactly what the link does. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. Details about DB2/UDB Enterprise Stage properties are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Output Link Properties Tab


The Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from what table. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The Build SQL button allows you to instantly open the SQL Builder to help you construct an SQL query to read data. See Chapter 59, "SQL Builder" for guidance on using it. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Source/Lookup Type

Values
Normal/ Sparse

Default
Normal

Mandatory?
Y (if output is reference link connected to Lookup stage) Y

Repeats?
N

Dependent of
N/A

Source/Read Method

Table/ Autogenerated SQL/Userdefined SQL string

Table

N/A

Source/Table

N/A N/A N/A N/A N/A True

Y (if Read Method = Table) N N

N N N

N/A Table Table N/A Query N/A

Source/Where clause string Source/Select List Source/Query Source/Partition Table Connection/Use Default Database string string string True/False

Y (if Read N Method = Query) N Y N N

Parallel Job Developers Guide

12-39

Outputs Page

DB2/UDB Enterprise Stage

Category/ Property
Connection/Use Default Server Connection/Server

Values
True/False string

Default
True N/A

Mandatory?
Y Y (if Use Database environment variable = False) Y (if Use Server environment variable = False) N N N

Repeats?
N N

Dependent of
N/A N/A

Connection/Database string

N/A

N/A

Connection/Client Instance Name Options/Close Command Options/Open Command

string string string

N/A N/A N/A

N N N

N/A N/A N/A

Source Category
Lookup Type Where the DB2/UDB Enterprise Stage is connected to a Lookup stage via a reference link, this property specifies whether the DB2/UDB Enterprise Stage will provide data for an in-memory look up (Lookup Type = Normal) or whether the lookup will access the database directly (Lookup Type = Sparse). If the Lookup Type is Normal, the Lookup stage can have multiple reference links. If the Lookup Type is Sparse, the Lookup stage can only have one reference link. Read Method This property specifies whether you are specifying a table or a query when reading the DB2/UDB database, and how you are generating the query: Select the Table method in order to use the Table property to specify the read. This will read in parallel. Select Auto-generated SQL to have DataStage automatically generate an SQL query based on the columns you have defined and the table you specify in the Table property. Select User-defined SQL to define your own query. Select SQL Builder Generated SQL to open the SQL Builder and define the query using its helpful interface (see Chapter 59, "SQL Builder.")
12-40 Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Outputs Page

By default, Read methods of SQL Builder Generated SQL, Autogenerated SQL, and User-defined SQL operate sequentially on a single node. You can have the User-defined SQL read operate in parallel if you specify the Partition Table property. Query This property is used to contain the SQL query when you choose a Read Method of User-defined query or Auto-generated SQL. If you are using Auto-generated SQL you must select a table and specify some column definitions. An SQL statement can contain joins, views, database links, synonyms, and so on. It has the following dependent option: Partition Table Specifies execution of the query in parallel on the processing nodes containing a partition derived from the named table. If you do not specify this, the stage executes the query sequentially on a single node. Table Specifies the name of the DB2 table. The table must exist and you must have SELECT privileges on the table. If your DB2 user name does not correspond to the owner of the specified table, you can prefix it with a table owner in the form:
table_owner.table_name

If you use a Read method of Table, then the Table property has two dependent properties: Where clause Allows you to specify a WHERE clause of the SELECT statement to specify the rows of the table to include or exclude from the read operation. If you do not supply a WHERE clause, all rows are read. Select List Allows you to specify an SQL select list of column names.

Connection Category
Use Default Server This is set to True by default, which causes the stage to use the setting of the DB2INSTANCE environment variable to derive the server. If you set this to False, you must specify a value for the Override Server property. (This does not appear if you are developing a job for deployment on a USS system).
Parallel Job Developers Guide 12-41

Outputs Page

DB2/UDB Enterprise Stage

Use Default Database This is set to True by default, which causes the stage to use the setting of the environment variable APT_DBNAME, if defined, and DB2DBDFT otherwise to derive the database. For USS systems, True causes the default DB2 sub-system to be used. If you set the property to False, you must specify a value for the Override Database property. Server Optionally specifies the DB2 instance name for the table. This property appears if you set Use Server Environment Variable property to False. (This does not appear if you are developing a job for deployment on a USS system). Database Optionally specifies the name of the DB2 database to access. This property appears if you set Use Database Environment Variable property to False. Client Instance Name This property is only required if you are connecting to a remote DB2 server. It specifies the DB2 client through which you are making the connection (see "Remote Connection" on page 12-4). (This does not appear if you are developing a job for deployment on a USS system).
Note Connection details are normally specified by environment variables as described in "Accessing DB2 Databases" on page 12-3. If you are specifying a remote connection, when you fill in the client instance name, user and password fields appear and allows you to specify these for connection to the remote server.

Options Category
Close Command This is an optional property. Use it to specify a command to be parsed and executed by the DB2 database on all processing nodes after the stage finishes processing the DB2 table. You can specify a job parameter if required. Open Command This is an optional property. Use it to specify a command to be parsed and executed by the DB2 database on all processing nodes before the DB2 table is opened. You can specify a job parameter if required.

12-42

Parallel Job Developers Guide

DB2/UDB Enterprise Stage

Outputs Page

Pad Character This appears when you are using a DB2 table as a lookup (i.e. have a Lookup Type of Sparse). It specifies the padding character to be used in the construction of a WHERE clause when it contains string columns that have a length less than the DB2 char column in the database. It defaults to null. (See "Using the Pad Character Property" on page 12-7.)

Parallel Job Developers Guide

12-43

Outputs Page

DB2/UDB Enterprise Stage

12-44

Parallel Job Developers Guide

13
Oracle Enterprise Stage
The Oracle Enterprise Stage is a database stage. It allows you to read data from and write data to an Oracle database. It can also be used in conjunction with a Lookup stage to access a lookup table hosted by an Oracle database (see Chapter 20, "Merge Stage.") The Oracle Enterprise Stage can have a single input link and a single reject link, or a single output link or output reference link. The stage performs one of the following operations: Updates an Oracle table using INSERT and/or UPDATE as appropriate. Data is assembled into arrays and written using Oracle host-array processing. Loads an Oracle table (using Oracle fast loader). Reads an Oracle table. Deletes rows from an Oracle table. Performs a lookup directly on an Oracle table. Loads an Oracle table into memory and then performs a lookup on it. When using an Oracle stage as a source for lookup data, there are special considerations about column naming. If you have columns of the same name in both the source and lookup data sets, note that the source data set column will go to the output data. If you want this column to be replaced by the column from the lookup data source, you need to drop the source data column before you perform the lookup (you could, for example, use a Modify stage to do this). See Chapter 20, "Merge Stage," for more details about performing lookups.

Parallel Job Developers Guide

13-1

Oracle Enterprise Stage

When you edit a Oracle Enterprise Stage, the Oracle Enterprise Stage editor appears. This is based on the generic stage editor described in Chapter 3, "Stage Editors." The stage editor has up to three pages, depending on whether you are reading or writing a database: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is present when you are writing to a Oracle database. This is where you specify details about the data being written. Outputs Page. This is present when you are reading from a Oracle database, or performing a lookup on an Oracle database. This is where you specify details about the data being read.

13-2

Parallel Job Developers Guide

Oracle Enterprise Stage

Accessing Oracle Databases

Accessing Oracle Databases


You need to be running Oracle 8 or better, Enterprise Edition in order to use the Oracle Enterprise Stage. You must also do the following:
1 2 3 4

Create the user defined environment variable ORACLE_HOME and set this to the $ORACLE_HOME path (e.g., /disk3/oracle9i). Create the user defined environment variable ORACLE_SID and set this to the correct service name (e.g., ODBCSOL). Add ORACLE_HOME/bin to your PATH and ORACLE_HOME/lib to your LIBPATH, LD_LIBRARY_PATH, or SHLIB_PATH. Have login privileges to Oracle using a valid Oracle user name and corresponding password. These must be recognized by Oracle before you attempt to access it. Have SELECT privilege on:

DBA_EXTENTS DBA_DATA_FILES DBA_TAB_PARTITONS DBA_TAB_SUBPARTITONS DBA_OBJECTS ALL_PART_INDEXES ALL_PART_TABLES ALL_INDEXES SYS.GV_$INSTANCE (Only if Oracle Parallel Server is used)

Note APT_ORCHHOME/bin must appear before ORACLE_HOME/ bin in your PATH.

We suggest that you create a role that has the appropriate SELECT privileges, as follows: CREATE ROLE DSXE; GRANT SELECT on sys.dba_extents to DSXE; GRANT SELECT on sys.dba_data_files to DSXE; GRANT SELECT on sys.dba_tab_partitions to DSXE; GRANT SELECT on sys.dba_tab_subpartitions to DSXE; GRANT SELECT on sys.dba_objects to DSXE; GRANT SELECT on sys.all_part_indexes to DSXE; GRANT SELECT on sys.all_part_tables to DSXE; GRANT SELECT on sys.all_indexes to DSXE;

Parallel Job Developers Guide

13-3

Accessing Oracle Databases

Oracle Enterprise Stage

Once the role is created, grant it to users who will run DataStage jobs, as follows: GRANT DSXE to <oracle userid>;

Handling Special Characters (# and $)


The characters # and $ are reserved in DataStage and special steps are needed to handle Oracle databases which use the characters # and $ in column names. DataStage converts these characters into an internal format, then converts them back as necessary. To take advantage of this facility, you need to do the following: In DataStage Administrator, open the Environment Variables dialog for the project in question, and set the environment variable DS_ENABLE_RESERVED_CHAR_CONVERT to true (this can be found in the General\Customize branch).

Avoid using the strings __035__ and __036__ in your Oracle column names (these are used as the internal representations of # and $ respectively). When using this feature in your job, you should import meta data using the Plug-in Meta Data Import tool, and avoid hand-editing (this minimizes the risk of mistakes or confusion). Once the table definition is loaded, the internal column names are displayed rather than the original Oracle names both in table definitions and in the Data Browser. They are also used in derivations and expressions. The original names are used in generated SQL statements, however, and you should use them if entering SQL in the job yourself. Generally, in the Oracle stage, you enter external names everywhere except when referring to stage column names, where you use names in the form ORCHESTRATE.internal_name.

13-4

Parallel Job Developers Guide

Oracle Enterprise Stage

Accessing Oracle Databases

When using the Oracle stage as a target, you should enter external names as follows: For Load options, use external names for select list properties. For Upsert option, for update and insert, use external names when referring to Oracle table column names, and internal names when referring to the stage column names. For example:
INSERT INTO tablename (A#, B$#) VALUES (ORCHESTRATE.A__036__A__035__, ORCHESTRATE.B__035__035__B__036__) UPDATE tablename SET B$# = ORCHESTRATE.B__035__035__B__036__ WHERE (A# = ORCHESTRATE.A__036__A__035__)

When using the Oracle stage as a source, you should enter external names as follows: For Read using the user-defined SQL method, use external names for Oracle columns for SELECT: For example:
SELECT M#$, D#$ FROM tablename WHERE (M#$ > 5)

For Read using Table method, use external names in select list and where properties. When using the Oracle stage in parallel jobs as a look-up, you should enter external or internal names as follows: For Lookups using the user-defined SQL method, use external names for Oracle columns for SELECT, and for Oracle columns in any WHERE clause you might add. Use internal names when referring to the stage column names in the WHERE clause. For example:
SELECT M$##, D#$ FROM tablename WHERE (B$# = ORCHESTRATE.B__035__ B __036__).

For Lookups using the Table method, use external names in select list and where properties. Use internal names for the key option on the Inputs page Properties tab of the Lookup stage to which the Oracle stage is attached.

Loading Tables
There are some special points to note when using the Load method in this stage (which uses the Oracle fast loader) to load tables with indexes. By default, the stage sets the following options in the Oracle load control file: DIRECT=TRUE PARALLEL = TRUE
Parallel Job Developers Guide 13-5

Accessing Oracle Databases

Oracle Enterprise Stage

This causes the load to run using parallel direct load mode. In order to use the parallel direct mode load, the table must not have indexes, or you must include one of the Index Mode properties, 'rebuild' or 'maintenance' (see page 13-24). If the only index on the table is from a primary key or unique key constraint, you can instead use the Disable Constraints property (see page 13-23) which will disable the primary key or unique key constraint, and enable it again after the load. If you set the Index Mode property to rebuild, the following options are set in the file: SKIP_INDEX_MAINTENANCE=YES PARALLEL=TRUE If you set the Index Mode property to maintenance, the following option is set in the file: PARALLEL=FALSE You can use the environment variable APT_ORACLE_LOAD_OPTIONS to control the options that are included in the Oracle load control file.You can load a table with indexes without using the Index Mode or Disable Constraints properties by setting the APT_ORACLE_LOAD_OPTIONS environment variable appropriately. You need to set the Direct option and/or the PARALLEL option to FALSE, for example:
APT_ORACLE_LOAD_OPTIONS='OPTIONS(DIRECT=FALSE,PARALLEL=TRUE)'

In this example the stage would still run in parallel, however, since DIRECT is set to FALSE, the conventional path mode rather than the direct path mode would be used. If APT_ORACLE_LOAD_OPTIONS is used to set PARALLEL to FALSE, then you must set the execution mode of the stage to run sequentially on the Advanced tab of the Stage page (see page 13-15). If loading index organized tables (IOTs), you should not set both DIRECT and PARALLEL to true as direct parallel path load is not allowed for IOTs.

Type Conversions - Writing to Oracle


When writing or loading, the Oracle Enterprise stage automatically converts DataStage data types to Oracle data types as shown in the following table::
DataStage SQL Data Type
Date

Underlying Data Type Oracle Data Type


date DATE

13-6

Parallel Job Developers Guide

Oracle Enterprise Stage

Accessing Oracle Databases

DataStage SQL Data Type


Time Timestamp Decimal Numeric TinyInt SmallInt Integer BigInt BigInt Float Real Double Binary Bit LongVarBinary VarBinary Unknown Char

Underlying Data Type Oracle Data Type


time timestamp decimal (p, s) int8/uint8 int16/uint16 int32/uint32 int64 uint64 sfloat dfloat raw DATE (does not support microsecond resolution) DATE (does not support microsecond resolution) NUMBER (p, s) NUMBER (3, 0) NUMBER (3, 0) NUMBER (10, 0) NUMBER (19, 0) NUMBER (20, 0) NUMBER NUMBER not supported

fixed-length string in the form string[n] and ustring[n]; length <= 255 bytes variable-length string, in the form string[max=n] and ustring[max=n]; maximum length <= 2096 bytes

CHAR(n) where n is the string length VARCHAR(n) where n is the maximum string length

LongVarChar VarChar

LongVarChar VarChar

variable-length string in VARCHAR(32)* the form string and ustring

The default length of VARCHAR is 32 bytes. That is, 32 bytes are allocated for each variable-length string field in the input data set. If an input variable-length string field is longer than 32 bytes, the stage issues a warning.

Parallel Job Developers Guide

13-7

Examples

Oracle Enterprise Stage

Type Conversions - Reading from Oracle


When reading, the Oracle Enterprise stage automatically converts Oracle data types to DataStage data types as shown in the following table:
DataStage SQL Data Type
Unknown Char LongVarChar VarChar NChar NVarChar LongNVarChar Unknown Char LongVarChar VarChar NChar NVarChar LongNVarChar Timestamp Decimal Numeric Integer Decimal Numeric

Underlying Data Type Oracle Data Type


string[n] or ustring[n] Fixed length string with length = n CHAR(n)

string[max = n] or ustring[max = n] variable length string with length = n

VARCHAR(n)

Timestamp decimal (38,10) int32 if precision (p) <11 and scale (s) = 0 decimal[p, s] if precision (p) =>11 and scale (s) > 0 not supported

DATE NUMBER NUMBER(p, s)

not supported

RAW(n)

Examples
Looking Up an Oracle Table
This example shows what happens when data is looked up in an Oracle table. The stage in this case will look up the interest rate for each customer based on the account type. Here is the data that arrives on the primary link:
Customer
Latimer

accountNo
7125678

accountType
plat

balance
7890.76

13-8

Parallel Job Developers Guide

Oracle Enterprise Stage

Examples

Ridley Cranmer Hooper Moore

7238892 7611236 7176672 7146789

flexi gold flexi gold

234.88 1288.00 3456.99 424.76

Here is the data in the Oracle lookup table:


accountType
bronze silver gold plat flexi fixterm

InterestRate
1.25 1.50 1.75 2.00 1.88 3.00

Here is what the lookup stage will output:


Customer
Latimer Ridley Cranmer Hooper Moore

accountNo accountType balance InterestRate


7125678 7238892 7611236 7176672 7146789 plat flexi gold flexi gold 7890.76 234.88 1288.00 3456.99 424.76 2.00 1.88 1.75 1.88 1.75

The job looks like the one illustrated on page 13-2. The Data_set stage provides the primary input, Oracle_8 provides the lookup data, Lookup_1 performs the lookup and outputs the resulting data to Data_Set_3. In the Oracle stage we specify that we are going to look up the data directly in the Oracle database, and the name of the table we are going to look up. In the Look up stage we specify the column that we are using as the key for the look up.

Parallel Job Developers Guide

13-9

Examples

Oracle Enterprise Stage

The properties for the Oracle stage are as follows:

The properties for the look up stage are as follows:

Updating an Oracle Table


This example shows an Oracle table being updated with three new columns. The database records the horse health records of a large stud. Details of the worming records are being added to the main table and populated with the most recent data, using the existing

13-10

Parallel Job Developers Guide

Oracle Enterprise Stage

Examples

column name as a key. The meta data for the new columns is as follows:

We are going to specify upsert as the write method and choose Userdefined Update & Insert as the upsert mode, this is so that we do not include the existing name column in the INSERT statement. The properties (showing the INSERT statement) are shown below. The INSERT statement is as generated by the DataStage, except the name column is removed.

Parallel Job Developers Guide

13-11

Must Dos

Oracle Enterprise Stage

The UPDATE statement is as automatically generated by DataStage:

Must Dos
DataStage has many defaults which means that it can be very easy to include Oracle Enterprise Stages in a job. This section specifies the minimum steps to take to get a Oracle Enterprise Stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. The steps required depend on what you are using an Oracle Enterprise Stage for.

Updating an Oracle Database


In the Input Link Properties Tab, under the Target category specify the update method as follows:

Specify a Write Method of Upsert. Specify the Table you are writing. Choose the Upsert Mode, this allows you to specify whether to insert and update, or update only, and whether to use a statement automatically generated by DataStage or specify your own. If you have chosen an Upsert Mode of User-defined Update and Insert, specify the Insert SQL statement to use. DataStage provides the auto-generated statement as a basis, which you can edit as required. If you have chosen an Upsert Mode of User-defined Update and Insert or User-defined Update only, specify the Update SQL statement to use. DataStage provides the auto-generated statement as a basis, which you can edit as required.

13-12

Parallel Job Developers Guide

Oracle Enterprise Stage

Must Dos

Under the Connection category, you can either manually specify a connection string, or have DataStage generate one for you using a user name and password you supply. Either way you need to supply a valid username and password. DataStage encrypts the password when you use the auto-generate option. By default, DataStage assumes Oracle resides on the local server, but you can specify a remote server if required. Under the Options category:

If you want to send rejected rows down a rejects link, set Output Rejects to True (it is false by default).

Ensure column meta data has been specified for the write.

Deleting Rows from an Oracle Database


This is the same as writing an Oracle database, except you need to specify details of the SQL statements used to delete rows from the database: In the Input Link Properties Tab:

Choose a Write Method of Delete Rows. Choose the Delete Rows Mode, this allows you to specify whether to use a statement automatically generated by DataStage or specify your own. If you have chosen a Delete Rows Mode of User-defined delete, specify the Delete SQL statement to use. DataStage provides the auto-generated statement as a basis, which you can edit as required.

Loading an Oracle Database


This is the default write method. In the Input Link Properties Tab, under the Target category:

Specify a Write Method of Load. Specify the Table you are writing. Specify the Write Mode (by default DataStage appends to existing tables, you can also choose to create a new table, replace an existing table, or keep existing table details but replace all the rows).

Under the Connection category, you can either manually specify a connection string, or have DataStage generate one for you using a user name and password you supply. Either way you need to

Parallel Job Developers Guide

13-13

Must Dos

Oracle Enterprise Stage

supply a valid username and password. DataStage encrypts the password when you use the auto-generate option. By default, DataStage assumes Oracle resides on the local server, but you can specify a remote server if required. Ensure column meta data has been specified for the write.

Reading an Oracle Database


In the Output Link Properties Tab:

Choose a Read Method. This is Table by default, but you can also choose to read using auto-generated SQL or usergenerated SQL. The read operates sequentially on a single node unless you specify a Partition Table property (which causes parallel execution on the processing nodes containing a partition derived from the named table). Specify the table to be read. If using a Read Method of user-generated SQL, specify the SELECT SQL statement to use. DataStage provides the autogenerated statement as a basis, which you can edit as required.

Under the Connection category, you can either manually specify a connection string, or have DataStage generate one for you using a user name and password you supply. Either way you need to supply a valid username and password. DataStage encrypts the password when you use the auto-generate option. By default, DataStage assumes Oracle resides on the local server, but you can specify a remote server if required. Ensure column meta data has been specified for the read.

Performing a Direct Lookup on an Oracle Database Table


Connect the Oracle Enterprise Stage to a Lookup stage using a reference link. In the Output Link Properties Tab:

Set the Lookup Type to Sparse. Choose a Read Method. This is Table by default (which reads directly from a table), but you can also choose to read using auto-generated SQL or user-generated SQL. Specify the table to be read for the lookup.

13-14

Parallel Job Developers Guide

Oracle Enterprise Stage

Stage Page

If using a Read Method of user-generated SQL, specify the SELECT SQL statement to use. DataStage provides the autogenerated statement as a basis, which you can edit as required. You would use this if, for example, you wanted to perform a non-equality based lookup.

Under the Connection category, you can either manually specify a connection string, or have DataStage generate one for you using a user name and password you supply. Either way you need to supply a valid username and password. DataStage encrypts the password when you use the auto-generate option. By default, DataStage assumes Oracle resides on the local server, but you can specify a remote server if required. Ensure column meta data has been specified for the lookup.

Performing an In Memory Lookup on an Oracle Database Table


This is the default method. It has the same requirements as a direct lookup, except: In the Output Link Properties Tab:

Set the Lookup Type to Normal.

Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes. The NLS Map tab appears if you have NLS enabled on your system, it allows you to specify a character set map for the stage.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the data is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage.

Parallel Job Developers Guide

13-15

Stage Page

Oracle Enterprise Stage

Preserve partitioning. You can select Set or Clear. If you select Set read operations will request that the next stage preserves the partitioning as is (it is ignored for write operations). Note that this field is only visible if the stage has output links. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

NLS Map
The NLS Map tab allows you to define a character set map for the Oracle Enterprise stage. You can set character set maps separately for NCHAR and NVARCHAR2 types and all other data types. This overrides the default character set map set for the project or the job. You can specify that the map be supplied as a job parameter if required.

Load performance may be improved by specifying an Oracle map instead of a DataStage map. To do this, add an entry to the file

13-16

Parallel Job Developers Guide

Oracle Enterprise Stage

Inputs Page

oracle_cs, located at $APT_ORCHHOME/etc, to associate the DataStage map with an Oracle map. The oracle_cs file has the following format:
UTF-8 ISO-8859-1 EUC-JP UTF8 WE8ISO8859P1 JA16EUC

The first column contains DataStage map names and the second column the Oracle map names they are associated with. So, using the example file shown above, specifying the DataStage map EUC-JP in the Oracle stage will cause the data to be loaded using the Oracle map JA16EUC.

Inputs Page
The Inputs page allows you to specify details about how the Oracle Enterprise Stage writes data to a Oracle database. The Oracle Enterprise Stage can have only one input link writing to one table. The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the database. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Oracle Enterprise Stage properties, partitioning, and formatting are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Input Link Properties Tab


The Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and where. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

Parallel Job Developers Guide

13-17

Inputs Page

Oracle Enterprise Stage

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Target/Table Target/Delete Rows Mode

Values
string Autogenerated delete/userdefined delete

Default Mandatory?
N/A Autogenerate d delete Y (if Write Method = Load) Y if Write method = Delete Rows

Repeats?
N N

Dependent of
N/A N/A

Target/Delete SQL String

N/A

Y if Write method = Delete Rows Y (if Write Method = Upsert)

N/A

Target/Upsert mode

Autogenerated Update & insert/Autogenerated Update Only/ User-defined Update & Insert/Userdefined Update Only string number string

Autogenerate d Update & insert

N/A

Target/Insert SQL Target/Insert Array Size Target/Update SQL

N/A 500 N/A

N N Y (if Write Method = Upsert)

N N N

N/A Insert SQL N/A

Target/Write Method Target/Write Mode

Delete Rows/ Upsert/Load Append/ Create/ Replace/ Truncate string Autogenerate/ User-defined

Load Append

Y Y (if Write Method = Load)

N N

N/A N/A

Connection/DB Options Connection/DB Options Mode

N/A Autogenerate

Y Y

N N

N/A N/A

13-18

Parallel Job Developers Guide

Oracle Enterprise Stage

Inputs Page

Category/ Property
Connection/User

Values
string

Default Mandatory?
N/A Y (if DB Options Mode = Autogenerate) Y (if DB Options Mode = Autogenerate) N Y (if Write Method = Upsert) Y (if Write Method = Load) Y (if Write Method = Load and Write Mode = Create or Replace) Y (if Write Method = Load) N N N N

Repeats?
N

Dependent of
DB Options Mode

Connection/ Password Connection/ Remote Server Options/Output Reject Records Options/Silently Drop Columns Not in Table Options/Table Organization

string

N/A

DB Options Mode

string True/False

N/A False

N N

N/A N/A

True/False

False

N/A

Heap/Index

Heap

N/A

Options/Truncate Column Names Options/Close Command Options/Default String Length Options/Index Mode Options/Add NOLOGGING clause to Index rebuild Options/Add COMPUTE STATISTICS clause to Index rebuild Options/Open Command Options/Oracle 8 Partition

True/False string number

False N/A 32

N N N N N

N/A N/A N/A N/A Index Mode

Maintenance/ N/A Rebuild True/False False

True/False

False

Index Mode

string string

N/A N/A

N N

N N

N/A N/A

Parallel Job Developers Guide

13-19

Inputs Page

Oracle Enterprise Stage

Category/ Property
Options/Create Primary Keys Options/Disable Constraints Options/ Exceptions Table

Values
True/False

Default Mandatory?
False Y (if Write Mode = Create or Replace) Y (if Write Method = Load) N N

Repeats?
N

Dependent of
N/A

True/False string

False N/A False

N N N

N/A Disable Constraints N/A

Options/Table has True/False NCHAR/ NVARCHAR

Target Category
Table Specify the name of the table to write to. You can specify a job parameter if required. Delete Rows Mode This only appears for the Delete Rows write method. Allows you to specify how the delete statement is to be derived. Choose from: Auto-generated Delete. DataStage generates a delete statement for you, based on the values you have supplied for table name and column details. The statement can be viewed by selecting the Delete SQL property. User-defined Delete. Select this to enter your own delete statement. Then select the Delete SQL property and edit the statement proforma. Delete SQL Only appears for the Delete Rows write method. This property allows you to view an auto-generated Delete statement, or to specify your own (depending on the setting of the Delete Rows Mode property). Upsert mode This only appears for the Upsert write method. Allows you to specify how the insert and update statements are to be derived. Choose from:

13-20

Parallel Job Developers Guide

Oracle Enterprise Stage

Inputs Page

Auto-generated Update & Insert. DataStage generates update and insert statements for you, based on the values you have supplied for table name and on column details. The statements can be viewed by selecting the Insert SQL or Update SQL properties. Auto-generated Update Only. DataStage generates an update statement for you, based on the values you have supplied for table name and on column details. The statement can be viewed by selecting the Update SQL properties. User-defined Update & Insert. Select this to enter your own update and insert statements. Then select the Insert SQL and Update SQL properties and edit the statement proformas. User-defined Update Only. Select this to enter your own update statement. Then select the Update SQL property and edit the statement proforma. Insert SQL Only appears for the Upsert write method. This property allows you to view an auto-generated Insert statement, or to specify your own (depending on the setting of the Update Mode property). It has a dependent property: Insert Array Size Specify the size of the insert host array. The default size is 500 records. If you want each insert statement to be executed individually, specify 1 for this property. Update SQL Only appears for the Upsert write method. This property allows you to view an auto-generated Update statement, or to specify your own (depending on the setting of the Upsert Mode property). Write Method Choose from Delete Rows, Upsert or Load (the default). Upsert allows you to provide the insert and update SQL statements and uses Oracle host-array processing to optimize the performance of inserting records. Load sets up a connection to Oracle and inserts records into a table, taking a single input data set. The Write Mode property determines how the records of a data set are inserted into the table. Write Mode This only appears for the Load Write Method. Select from the following:

Parallel Job Developers Guide

13-21

Inputs Page

Oracle Enterprise Stage

Append. This is the default. New records are appended to an existing table. Create. Create a new table. If the Oracle table already exists an error occurs and the job terminates. You must specify this mode if the Oracle table does not exist. Replace. The existing table is first dropped and an entirely new table is created in its place. Oracle uses the default partitioning method for the new table. Truncate. The existing table attributes (including schema) and the Oracle partitioning keys are retained, but any existing records are discarded. New records are then appended to the table.

Connection Category
DB Options Specify a user name and password for connecting to Oracle in the form:
<user=<user>,password=<password>[,arraysize= <num_records>]

DataStage does not encrypt the password when you use this option. Arraysize is only relevant to the Upsert Write Method. DB Options Mode If you select Auto-generate for this property, DataStage will create a DB Options string for you. If you select User-defined, you have to edit the DB Options property yourself. When Auto-generate is selected, there are two dependent properties: User The user name to use in the auto-generated DB options string. Password The password to use in the auto-generated DB options string. DataStage encrypts the password. Remote Server This is an optional property. Allows you to specify a remote server name.

13-22

Parallel Job Developers Guide

Oracle Enterprise Stage

Inputs Page

Options Category
Create Primary Keys This option is available with a Write Mode of Create or Replace. It is False by default, if you set it True, the columns marked as keys in the Columns tab will be marked as primary keys. You must set this true if you want to write index organized tables, and indicate which are the primary keys on the Columns tab. Note that, if you set it to True, the Index Mode option is not available. Disable Constraints This is False by default. Set True to disable all enabled constraints on a table when loading, then attempt to reenable them at the end of the load. This option is not available when you select a Table Organization type of Index to use index organized tables. When set True, it has a dependent property: Exceptions Table This property enables you to specify an exceptions table, which is used to record ROWID information on rows that violate constraints when the constraints are reenabled. The table must already exist. Output Reject Records This only appears for the Upsert write method. It is False by default, set to True to send rejected records to the reject link. Silently Drop Columns Not in Table This only appears for the Load Write Method. It is False by default. Set to True to silently drop all input columns that do not correspond to columns in an existing Oracle table. Otherwise the stage reports an error and terminates the job. Table Organization This appears only for the Load Write Method using the Create or Replace Write Mode. Allows you to specify Index (for index organized tables) or heap organized tables (the default). When you select Index, you must also set Create Primary Keys to true. In index organized tables (IOTs) the rows of the table are held in the index created from the primary keys.

Parallel Job Developers Guide

13-23

Inputs Page

Oracle Enterprise Stage

Truncate Column Names This only appears for the Load Write Method. Set this property to True to truncate column names to 30 characters. Close Command This is an optional property and only appears for the Load Write Method. Use it to specify any command, in single quotes, to be parsed and executed by the Oracle database on all processing nodes after the stage finishes processing the Oracle table. You can specify a job parameter if required. Default String Length This is an optional property and only appears for the Load Write Method. It is set to 32 by default. Sets the default string length of variable-length strings written to a Oracle table. Variable-length strings longer than the set length cause an error. The maximum length you can set is 2000 bytes. Note that the stage always allocates the specified number of bytes for a variable-length string. In this case, setting a value of 2000 allocates 2000 bytes for every string. Therefore, you should set the expected maximum length of your largest string and no larger. Index Mode This is an optional property and only appears for the Load Write Method. Lets you perform a direct parallel load on an indexed table without first dropping the index. You can choose either Maintenance or Rebuild mode. The Index property only applies to append and truncate Write Modes. Rebuild skips index updates during table load and instead rebuilds the indexes after the load is complete using the Oracle alter index rebuild command. The table must contain an index, and the indexes on the table must not be partitioned. The Rebuild option has two dependent properties: Add NOLOGGING clause to Index rebuild This is False by default. Set True to add a NOLOGGING clause. Add COMPUTE STATISTICS clause to Index rebuild This is False by default. Set True to add a COMPUTE STATISTICS clause. Maintenance results in each table partitions being loaded sequentially. Because of the sequential load, the table index that exists before the table is loaded is maintained after the table is loaded.
13-24 Parallel Job Developers Guide

Oracle Enterprise Stage

Inputs Page

The table must contain an index and be partitioned, and the index on the table must be a local range-partitioned index that is partitioned according to the same range values that were used to partition the table. Note that in this case sequential means sequential per partition, that is, the degree of parallelism is equal to the number of partitions. Open Command This is an optional property and only appears for the Load Write Method. Use it to specify a command, in single quotes, to be parsed and executed by the Oracle database on all processing nodes before the Oracle table is opened. You can specify a job parameter if required. Oracle 8 Partition This is an optional property and only appears for the Load Write Method. Name of the Oracle 8 table partition that records will be written to. The stage assumes that the data provided is for the partition specified. Table has NCHAR/NVARCHAR This option applies to Create or Replace Write Modes. Set it True if the table being written contains NCHAR and NVARCHARS, so that the correct columns are created in the target table.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the Oracle database. It also allows you to specify that the data should be sorted before being written. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the Oracle Enterprise Stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Oracle Enterprise Stage is set to execute in parallel or sequential mode.

Parallel Job Developers Guide

13-25

Inputs Page

Oracle Enterprise Stage

Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Oracle Enterprise Stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type dropdown list. This will override any current partitioning. If the Oracle Enterprise Stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Oracle Enterprise Stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. This is the default for Oracle Enterprise Stages. DB2. Replicates the partitioning method of the specified DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Oracle Enterprise Stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on.
13-26 Parallel Job Developers Guide

Oracle Enterprise Stage

Outputs Page

Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the file or files. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about how the Oracle Enterprise Stage reads data from a Oracle database. The Oracle Enterprise Stage can have only one output link. Alternatively it can have a reference output link, which is used by the Lookup stage when referring to a Oracle lookup table. It can also have a reject link where rejected records are routed (used in conjunction with an input link). The Output Name drop-down list allows you to choose whether you are looking at details of the main output link or the reject link.

Parallel Job Developers Guide

13-27

Outputs Page

Oracle Enterprise Stage

The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. Details about Oracle Enterprise Stage properties are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Output Link Properties Tab


The Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from what table. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The Build SQL button allows you to instantly open the SQL Builder to help you construct an SQL query to read data. See Chapter 59, "SQL Builder" for guidance on using it. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Source/Lookup Type

Values
Normal/ Sparse

Default
Normal

Mandatory?
Y (if output is reference link connected to Lookup stage) Y

Repeats?
N

Dependent of
N/A

Source/Read Method

Autogenerated SQL /Table/SQL builder generated SQL /User-defined SQL string string string string string

SQL builder generated SQL

N/A

Source/Table Source/Where Source/Select List Source/Query Source/Partition Table

N/A N/A N/A N/A N/A

N N N N N

N N N N N

N/A Table Table N/A N/A

13-28

Parallel Job Developers Guide

Oracle Enterprise Stage

Outputs Page

Category/ Property
Connection/DB Options Connection/DB Options Mode Connection/User

Values
string Autogenerate/ User-defined string

Default
N/A Autogenerate N/A

Mandatory?
Y Y

Repeats?
N N

Dependent of
N/A N/A

Y (if DB Options Mode = Autogenerate) Y (if DB Options Mode = Autogenerate) N N N Y (if link is reference and Lookup type = sparse) N

DB Options Mode

Connection/ Password Connection/Remote Server Options/Close Command Options/Open Command Options/Make Combinable

string

N/A

DB Options Mode

string string string True/False

N/A N/A N/A False

N N N N

N/A N/A N/A N/A

Options/Table has NCHAR/NVARCHAR

True/False

False

N/A

Source Category
Lookup Type Where the Oracle Enterprise Stage is connected to a Lookup stage via a reference link, this property specifies whether the Oracle Enterprise Stage will provide data for an in-memory look up (Lookup Type = Normal) or whether the lookup will access the database directly (Lookup Type = Sparse). If the Lookup Type is Normal, the Lookup stage can have multiple reference links. If the Lookup Type is Sparse, the Lookup stage can only have one reference link. Read Method This property specifies whether you are specifying a table or a query when reading the Oracle database, and how you are generating the query.

Parallel Job Developers Guide

13-29

Outputs Page

Oracle Enterprise Stage

Select the Table method in order to use the Table property to specify the read. This will read in parallel. Select Auto-generated SQL to have DataStage automatically generate an SQL query based on the columns you have defined and the table you specify in the Table property. Select User-defined SQL to define your own query. By default a user-defined or auto-generated SQL will read sequentially on one node. Read methods of Auto-generated SQL and User-defined SQL operate sequentially on a single node. You can have the Userdefined SQL read operate in parallel if you specify the Partition Table property. Select SQL Builder Generated SQL to open the SQL Builder and define the query using its helpful interface (see Chapter 59, "SQL Builder.") By default, Read methods of SQL Builder Generated SQL, Autogenerated SQL, and User-defined SQL operate sequentially on a single node. You can have the User-defined SQL read operate in parallel if you specify the Partition Table property. Query Optionally allows you to specify an SQL query to read a table. The query specifies the table and the processing that you want to perform on the table as it is read by the stage. This statement can contain joins, views, database links, synonyms, and so on. Table Specifies the name of the Oracle table. The table must exist and you must have SELECT privileges on the table. If your Oracle user name does not correspond to the owner of the specified table, you can prefix it with a table owner in the form:
table_owner.table_name

Table has dependent properties: Where Stream links only. Specifies a WHERE clause of the SELECT statement to specify the rows of the table to include or exclude from the read operation. If you do not supply a WHERE clause, all rows are read. Select List Optionally specifies an SQL select list, enclosed in single quotes, that can be used to determine which columns are read. You must specify the columns in list in the same order as the columns are defined in the record schema of the input table.
13-30 Parallel Job Developers Guide

Oracle Enterprise Stage

Outputs Page

Partition Table Specifies execution of the SELECT in parallel on the processing nodes containing a partition derived from the named table. If you do not specify this, the stage executes the query sequentially on a single node.

Connection Category
DB Options Specify a user name and password for connecting to Oracle in the form:
<user=<user>,password=<password>[,arraysize=<num_records>]

DataStage does not encrypt the password when you use this option. Arraysize only applies to stream links. The default arraysize is 1000. DB Options Mode If you select Auto-generate for this property, DataStage will create a DB Options string for you. If you select User-defined, you have to edit the DB Options property yourself. When Auto-generate is selected, there are two dependent properties: User The user name to use in the auto-generated DB options string. Password The password to use in the auto-generated DB options string. DataStage encrypts the password Remote Server This is an optional property. Allows you to specify a remote server name.

Options Category
Close Command This is an optional property and only appears for stream links. Use it to specify any command to be parsed and executed by the Oracle database on all processing nodes after the stage finishes processing the Oracle table. You can specify a job parameter if required.

Parallel Job Developers Guide

13-31

Outputs Page

Oracle Enterprise Stage

Open Command This is an optional property only appears for stream links. Use it to specify any command to be parsed and executed by the Oracle database on all processing nodes before the Oracle table is opened. You can specify a job parameter if required Make Combinable Only applies to reference links where the Lookup Type property has been set to Sparse. Set to True to specify that the lookup can be combined with its preceding and/or following process. Table has NCHAR/NVARCHAR Set this True if the table being read from contains NCHAR and NVARCHARS.

13-32

Parallel Job Developers Guide

14
Teradata Enterprise Stage
The Teradata Enterprise stage is a database stage. It allows you to read data from and write data to a Teradata database. The Teradata Enterprise stage can have a single input link or a single output link.

When you edit a Teradata Enterprise stage, the Teradata Enterprise stage editor appears. This is based on the generic stage editor described in Chapter 3, "Stage Editors," The stage editor has up to three pages, depending on whether you are reading or writing a file: Stage Page. This is always present and is used to specify general information about the stage.

Parallel Job Developers Guide

14-1

Accessing Teradata Databases

Teradata Enterprise Stage

Inputs Page. This is present when you are writing to a Teradata database. This is where you specify details about the data being written. Outputs Page. This is present when you are reading from a Teradata database. This is where you specify details about the data being read.

Accessing Teradata Databases


Installing the Teradata Utilities Foundation
You must install Teradata Utilities Foundation on all nodes that will run DataStage parallel jobs. Refer to the installation instructions supplied by Teradata. (You need system administrator status for the install.)

Creating Teradata User


You must set up a Teradata database user (this is the user that will be referred to by the DB options property in the Teradata stage). The user must be able to create tables and insert and delete data. The database for which you create this account requires at least 100 MB of PERM space and 10 MB of SPOOL. Larger allocations may be required if you run large and complex jobs. (You need database administrator status in order to create user and database.) The example below shows you how to create the orchserver account. The user information is stored in the terasync table. The name of the database in this example is userspace. The following four commands for BTEQ set up the account:
CREATE USER orchserver FROM userspace AS PASSWORD = orchserver PERM = 100000000 SPOOL = 10000000

Once the account is set up, issue the following command:


GRANT select ON dbc TO orchserver;

Creating a Database Server


If you want to use a pre-existing Teradata user, you only need install a database server and configure it to use a new database. Install the new database server with the same PERM and SPOOL values as shown above. Here is an example of creating a database server called devserver using table userspace:
14-2 Parallel Job Developers Guide

Teradata Enterprise Stage

Teradata Databases Points to Note

CREATE DATABASE devserver FROM userspace AS PERM = 100000000 SPOOL = 10000000 GRANT create table, insert, delete, select ON devserver TO orchclient; GRANT create table, insert, delete, select ON devserver TO orchserver;

Teradata Databases Points to Note


NLS Support and Teradata Database Character Sets
The Teradata database supports a fixed number of character set types for each char or varchar column in a table. Use this query to get the character set for a Teradata column:
select column_name', chartype from dbc.columns where tablename = 'table_name'

The database character set types are: Latin: chartype=1. The character set for U.S. and European applications which limit character data to the ASCII or ISO 8859 Latin1 character sets. This is the default. Unicode: chartype=2. 16-bit Unicode characters from the ISO 10646 Level 1 character set. This setting supports all of the ICU multi-byte character sets. KANJISJIS: chartype=3. For Japanese third-party tools that rely on the string length or physical space allocation of KANJISJIS. Graphic: chartype=4. Provided for DB2 compatibility.
Note The KANJI1: chartype=5 character set is available for Japanese applications that must remain compatible with previous releases; however, this character set will be removed in a subsequent release because it does not support the new string functions and will not support future characters sets. We recommend that you use the set of SQL translation functions provided to convert KANJI1 data to Unicode.

DataStage maps characters between Teradata columns and the internal UTF-16 Unicode format using the project default character set map unless this has been overridden at the job level (on the Job Properties dialog box) or the stage level (using the NLS Map tab see page 14-9). The file tera_cs.txt in the directory $APT_ORCHHOME/etc maps DataStage NLS character sets to Teradata character sets. For example, we select the EUC_JP as the NLS map for the current project. EUC_JP is the NLS character set for Japanese, 118 is the Teradata character set
Parallel Job Developers Guide 14-3

Teradata Databases Points to Note

Teradata Enterprise Stage

code for the KANJIEUC_0U character set. EUC_JPN is mapped to 118 in tera_cs.txt as follows:
EUC_JP ASC_JPN_EUC SJIS 118 118 119

On reading, DataStage converts a Teradata varchar(n) field to ustring [n/min] where min is the minimum size in bytes of the largest codepoint for your specified character set. On writing, ustring data is converted to the specified character set and written to a char or varchar column in the Teradata database; the type is ustring[n*max] where max is the maximum size in of the largest codepoint for your specified character set. DataStage also supports the use of Unicode character data in usernames, passwords, column names, table names, and database names.

Column Name and Data Type Conversion


DataStage column names are case sensitive, Teradata column names are not. You must ensure that the DataStage column names are unique regardless of case. Both DataStage and Teradata columns support nulls, and a DataStage column that contains a null is stored as a null in the corresponding Teradata column. The Teradata stage automatically converts DataStage data types to Teradata data types and vice versa as shown in the following table:
DataStage SQL Data Type
Date Decimal Numeric Double Double Double TinyInt SmallInt Integer BigInt

Underlying Data Type Teradata Data Type


date decimal (p, s) dfloat dfloat dfloat int8 int16 int32 int64 date numeric (p, s) double precision float real byteint smallint integer unsupported

14-4

Parallel Job Developers Guide

Teradata Enterprise Stage

Teradata Databases Points to Note

DataStage SQL Data Type


LongVarBinary VarBinary Binary Bit LongVarBinary VarBinary LongVarBinary VarBinary LongVarBinary VarBinary LongVarBinary VarBinary Float Real LongVarChar VarChar Unknown Char LongVarChar VarChar LongVarChar Time Timestamp TinyInt SmallInt Integer

Underlying Data Type Teradata Data Type


raw raw [fixed_size] raw [max=size] raw [max=size] raw [max=size] raw [max=size] sfloat string string [fixed_size] string[max = size] string[max = size] time timestamp uint8 uint16 uint32 varbyte (default) byte (fixed_size) varbyte (size) graphic (c) vargraphic (size) long vargraphic unsupported varchar (default length) char (fixed_size) varchar(size) long varchar (size) unsupported unsupported unsupported unsupported unsupported

DataStage columns are matched by name and data type to columns of the Teradata table, but they do not have to appear in the same order. The following rules determine which DataStage columns are written to a Teradata table: If there are DataStage columns for which there are no matching columns in the Teradata table, the job terminates. However, you can deal with this by setting the Silently drop columns not in table property (see page 14-14) or by dropping the column before you write the data.

Parallel Job Developers Guide

14-5

Teradata Databases Points to Note

Teradata Enterprise Stage

If the Teradata table contains a column that does not have a corresponding DataStage column, Teradata writes the columns default value into the field. If no default value is defined for the Teradata column, Teradata writes a null. If the field is not nullable, an error is generated and the job fails.

Restrictions and Limitations when Writing to a Teradata Database


There are the following limitations when using a Teradata Enterprise stage to write to a Teradata database: A Teradata row may contain a maximum of 256 columns. While the names of DataStage columns can be of any length, the names of Teradata columns cannot exceed 30 characters. Rename your columns if necessary or specify the Truncate column names property to deal automatically with overlength column names (see page 14-14. DataStage assumes that the stage writes to buffers whose maximum size is 32 KB. However, you can override this and enable the use of 64 KB buffers by setting the environment variable APT_TERA_64K_BUFFERS (see "APT_TERA_64K_BUFFERS" in Parallel Job Advanced Developers Guide). When writing to Teradata, the DataStage column definitions should not contain fields of the following types:

BigInt (int64) Unsigned integer of any size String, fixed- or variable-length, longer than 32 KB Raw, fixed- or variable-length, longer than 32 KB Subrecord Tagged aggregate Vectors

If DataStage tries to write data whose columns contain a data type listed above, the write is not begun and the job containing the stage fails. You can convert unsupported data types by using the Modify stage (see Chapter 28, "Modify Stage"). The Teradata Enterprise stage uses a distributed FastLoad to write the data and is subject to all the restrictions on FastLoad. Briefly, these are:

There is a limit to the number of concurrent FastLoad and FastExport jobs in Teradata.
Parallel Job Developers Guide

14-6

Teradata Enterprise Stage

Must Dos

Each instance of the Teradata stage using FastLoad or FastExport in a job counts towards this limit.

Restrictions on Reading a Teradata Database


The Teradata Enterprise stage uses a distributed FastExport to access the data and is subject to all the restrictions on FastExport. Briefly, these are: There is a limit to the number of concurrent FastLoad and FastExport jobs. Each instance of the Teradata stage using FastLoad or FastExport in a job counts towards this limit. Aggregates and most arithmetic operators in the SELECT statement are not allowed. The use of the USING modifier is not allowed. Non-data access (that is, pseudo-tables like DATE or USER) is not allowed. Single-AMP requests are not allowed. These are SELECTs satisfied by an equality term on the primary index or on a unique secondary index.

Must Dos
DataStage has many defaults which means that it can be very easy to include Teradata Enterprise Stages in a job. This section specifies the minimum steps to take to get a Teradata Enterprise Stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. The steps required depend on what you are using a Teradata Enterprise Stage for.

Writing a Teradata Database


In the Input Link Properties Tab, under the Target category:

Specify the Table you are writing. Specify the write mode (by default DataStage appends to existing tables, you can also choose to create a new table, replace an existing table, or keep existing table details but replace all the rows).
14-7

Parallel Job Developers Guide

Stage Page

Teradata Enterprise Stage

Under the Connection category:

You can either manually specify a connection string, or have DataStage generate one for you using a user name and password you supply. Either way you need to supply a valid username and password. DataStage encrypts the password when you use the auto-generate option. Specify the name of the server hosting Teradata.

Ensure column meta data has been specified for the write.

Reading a Teradata Database


In the Output Link Properties Tab, under the Source category:

Choose a Read Method. This is Table by default directly from a table, but you can also choose to read using auto-generated SQL or user-generated SQL. Specify the table to be read. If using a Read Method of user-generated SQL, specify the SELECT SQL statement to use. DataStage provides the autogenerated statement as a basis, which you can edit as required.

Under the Connection category:

You can either manually specify a connection string, or have DataStage generate one for you using a user name and password you supply. Either way you need to supply a valid username and password. DataStage encrypts the password when you use the auto-generate option. Specify the name of the server hosting Teradata.

Ensure column meta data has been specified for the read.

Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes. The NLS Map tab appears if you have NLS enabled on your system, it allows you to specify a character set map for the stage.

Advanced Tab
This tab allows you to specify the following:

14-8

Parallel Job Developers Guide

Teradata Enterprise Stage

Stage Page

Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the data is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. You can select Set or Clear. If you select Set read operations will request that the next stage preserves the partitioning as is (the Preserve partitioning field is not visible unless the stage has an output links). Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

NLS Map
The NLS Map tab allows you to define a character set map for the Teradata Enterprise stage. This overrides the default character set map

Parallel Job Developers Guide

14-9

Inputs Page

Teradata Enterprise Stage

set for the project or the job. You can specify that the map be supplied as a job parameter if required.

Inputs Page
The Inputs page allows you to specify details about how the Teradata Enterprise Stage writes data to a Teradata database. The Teradata Enterprise Stage can have only one input link writing to one table. The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the database. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Teradata Enterprise Stage properties, partitioning, and formatting are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Input Link Properties Tab


The Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and where. Some of the properties are mandatory, although many have default settings.

14-10

Parallel Job Developers Guide

Teradata Enterprise Stage

Inputs Page

Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Target/Table Target/Primary Index Target/Select List Target/Write Mode

Values
Table_Name Columns List List Append/ Create/ Replace/ Truncate String Database Name Server Name Close Command Open Command True/False

Default
N/A N/A N/A Append

Mandatory?
Y N N Y

Repeats?
N N N N

Dependent of
N/A Table Table N/A

Connection/DB Options Connection/ Database Connection/Server Options/Close Command Options/Open Command Options/Silently Drop Columns Not in Table Options/Default String Length Options/Truncate Column Names Options/Progress Interval

N/A N/A N/A 500 False False

Y N Y N N Y

N N N N N N

N/A N/A N/A Insert SQL N/A N/A

String Length 32 True/False Number False 100000

N Y N

N N N

N/A N/A N/A

Target Category
Table Specify the name of the table to write to. The table name must be a valid Teradata table name. Table has two dependent properties: Select List

Parallel Job Developers Guide

14-11

Inputs Page

Teradata Enterprise Stage

Specifies a list that determines which columns are written. If you do not supply the list, the Teradata Enterprise Stage writes to all columns. Do not include formatting characters in the list. Primary Index Specify a comma-separated list of column names that will become the primary index for tables. Format the list according to Teradata standards and enclose it in single quotes. For performance reasons, the data set should not be sorted on the primary index. The primary index should not be a smallint, or a column with a small number of values, or a high proportion of null values. If no primary index is specified, the first column is used. All the considerations noted above apply to this case as well. Write Mode Select from the following: Append. Appends new records to the table. The database user must have TABLE CREATE privileges and INSERT privileges on the table being written to. This is the default. Create. Creates a new table. The database user must have TABLE CREATE privileges. If a table exists of the same name as the one you want to create, the data flow that contains Teradata terminates in error. Replace. Drops the existing table and creates a new one in its place; the database user must have TABLE CREATE and TABLE DELETE privileges. If a table exists of the same name as the one you want to create, it is overwritten. Note that you cannot create or replace a table that has primary keys, you should not specify primary keys in your meta data. Truncate. Retains the table attributes, including the table definition, but discards existing records and appends new ones. The database user must have DELETE and INSERT privileges on the table.

Connection Category
DB Options Specify a user name and password for connecting to Teradata in the form:
<user = <user>, password= <password> [SessionsPerPlayer = <num_sessions>][RequestedSessions = <num_requested>]

14-12

Parallel Job Developers Guide

Teradata Enterprise Stage

Inputs Page

The value of sessionsperplayer determines the number of connections each player has to Teradata. Indirectly, it also determines the number of players. The number selected should be such that (sessionsperplayer * number of nodes * number of players per node) equals the total requested sessions. The default is 2. Setting the value of sessionsperplayer too low on a large system can result in so many players that the step fails due to insufficient resources. In that case, sessionsperplayer should be increased. The value of the optional requestedsessions is a number between 1 and the number of vprocs in the database. The default is the maximum number of available sessions. DataStage does not encrypt the password when you use this option. DB Options Mode If you select Auto-generate for this property, DataStage will create a DB Options string for you. If you select User-defined, you have to edit the DB Options property yourself. When Auto-generate is selected, there are two dependent properties: User The user name to use in the auto-generated DB options string. Password The password to use in the auto-generated DB options string. DataStage encrypts the password. Database By default, the write operation is carried out in the default database of the Teradata user whose profile is used. If no default database is specified in that users Teradata profile, the user name is the default database. If you supply the database name, the database to which it refers must exist and you must have necessary privileges. Server Specify the name of a Teradata server.

Options Category
Close Command Specify a Teradata command to be parsed and executed by Teradata on all processing nodes after the table has been populated.

Parallel Job Developers Guide

14-13

Inputs Page

Teradata Enterprise Stage

Open Command Specify a Teradata command to be parsed and executed by Teradata on all processing nodes before the table is populated. Silently Drop Columns Not in Table Specifying True causes the stage to silently drop all unmatched input columns; otherwise the job fails. Default String Length Specify the maximum length of variable-length raw or string columns. The default length is 32 bytes. The upper bound is slightly less than 32 KB. Truncate Column Names Specify whether the column names should be truncated to 30 characters or not. Progress Interval By default, the stage displays a progress message for every 100,000 records per partition it processes. Specify this option either to change the interval or to disable the message. To change the interval, specify a new number of records per partition. To disable the messages, specify 0.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the Teradata database. It also allows you to specify that the data should be sorted before being written. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the Teradata Enterprise Stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Teradata Enterprise Stage is set to execute in parallel or sequential mode.

14-14

Parallel Job Developers Guide

Teradata Enterprise Stage

Inputs Page

Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Teradata Enterprise Stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Teradata Enterprise Stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Teradata Enterprise Stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag columns. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. This is the default for Teradata Enterprise Stages. Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Teradata Enterprise Stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

Parallel Job Developers Guide

14-15

Outputs Page

Teradata Enterprise Stage

Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the database. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about how the Teradata Enterprise Stage reads data from a Teradata database. The Teradata Enterprise Stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link.

14-16

Parallel Job Developers Guide

Teradata Enterprise Stage

Outputs Page

Details about Teradata Enterprise Stage properties are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Output Link Properties Tab


The Properties tab allows you to specify properties for the output link. These dictate how incoming data is read and from what table. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category\Property Values
Source/Read Method Table/Autogenerated SQL/Userdefined SQL Table Name

Default
Table

Mandatory?
Y

Repeats? Dependent of
N N/A

Source/Table

Y (if Read N Method = Table or Autogenerated SQL) N N N N

N/A

Source/Select List Source/Where Clause Source/Query

List Filter SQL query

N/A N/A N/A

Table Table N/A

N Y (if Read Method = Userdefined SQL or Auto-generated SQL Y N Y N N N N N N N N N

Connection/DB Options Connection/Database Connection/Server Options/Close Command Options/Open Command Options/Progress Interval

String Database Name Server Name String String Number

N/A N/A N/A N/A N/A 100000

N/A N/A N/A N/A N/A N/A

Parallel Job Developers Guide

14-17

Outputs Page

Teradata Enterprise Stage

Source Category
Read Method Select Table to use the Table property to specify the read (this is the default). Select Auto-generated SQL this to have DataStage automatically generate an SQL query based on the columns you have defined and the table you specify in the Table property. You must select the Query property and select Generate from the right-arrow menu to actually generate the statement. Select User-defined SQL to define your own query. Table Specifies the name of the Teradata table to read from. The table must exist, and the user must have the necessary privileges to read it.
The Teradata

Enterprise Stage reads the entire table, unless you limit its scope by means of the Select List and/or Where suboptions: Select List Specifies a list of columns to read. The items of the list must appear in the same order as the columns of the table. Where Clause Specifies selection criteria to be used as part of an SQL statements WHERE clause. Do not include formatting characters in the query.

These dependent properties are only available when you have specified a Read Method of Table rather than Auto-generated SQL. Query This property is used to contain the SQL query when you choose a Read Method of User-defined query or Auto-generated SQL. If you are using Auto-generated SQL you must select a table and specify some column definitions, then select Generate from the right-arrow menu to have DataStage generate the query.

Connection Category
DB Options Specify a user name and password for connecting to Teradata in the form:
<user = <user>, password= <password> [SessionsPerPlayer = <num_sessions>][RequestedSessions = <num_requested>]

14-18

Parallel Job Developers Guide

Teradata Enterprise Stage

Outputs Page

The value of sessionsperplayer determines the number of connections each player has to Teradata. Indirectly, it also determines the number of players. The number selected should be such that (sessionsperplayer * number of nodes * number of players per node) equals the total requested sessions. The default is 2. Setting the value of sessionsperplayer too low on a large system can result in so many players that the step fails due to insufficient resources. In that case, sessionsperplayer should be increased. The value of the optional requestedsessions is a number between 1 and the number of vprocs in the database. The default is the maximum number of available sessions. DataStage does not encrypt the password when you use this option. DB Options Mode If you select Auto-generate for this property, DataStage will create a DB Options string for you. If you select User-defined, you have to edit the DB Options property yourself. When Auto-generate is selected, there are two dependent properties: User The user name to use in the auto-generated DB options string. Password The password to use in the auto-generated DB options string. DataStage encrypts the password. Database By default, the read operation is carried out in the default database of the Teradata user whose profile is used. If no default database is specified in that users Teradata profile, the user name is the default database. This option overrides the default. If you supply the database name, the database to which it refers must exist and you must have the necessary privileges. Server Specify the name of a Teradata server.

Options Category
Close Command Optionally specifies a Teradata command to be run once by Teradata on the conductor node after the query has completed.
Parallel Job Developers Guide 14-19

Outputs Page

Teradata Enterprise Stage

Open Command Optionally specifies a Teradata command run once by Teradata on the conductor node before the query is initiated. Progress Interval By default, the stage displays a progress message for every 100,000 records per partition it processes. Specify this option either to change the interval or to disable the message. To change the interval, specify a new number of regards per partition. To disable the messages, specify 0.

14-20

Parallel Job Developers Guide

15
Informix Enterprise Stage
The Informix Enterprise Stage is a database stage. It allows you to read data from and write data to an Informix 7.x, 8.x or 9.x database. The Informix Enterprise Stage can have a single input link or a single output link.

When you edit a Informix Enterprise Stage, the Informix Enterprise Stage editor appears. This is based on the generic stage editor described in Chapter 3, "Stage Editors." The stage editor has up to three pages, depending on whether you are reading or writing a database: Stage Page. This is always present and is used to specify general information about the stage.

Parallel Job Developers Guide

15-1

Accessing Informix Databases

Informix Enterprise Stage

Inputs Page. This is present when you are writing to an Informix database. This is where you specify details about the data being written. Outputs Page. This is present when you are reading from an Informix database. This is where you specify details about the data being read.

Accessing Informix Databases


You must have the correct privileges and settings in order to use the Informix Enterprise Stage. You must have a valid account and appropriate privileges on the databases to which you connect. You require read and write privileges on any table to which you connect, and Resource privileges for using the Partition Table property on an output link or using create and replace modes on an input link. To configure access to Informix:
1 2

Make sure that Informix is running. Make sure the INFORMIXSERVER is set in your environment. This corresponds to a server name in sqlhosts and is set to the coserver name of coserver 1. The coserver must be accessible from the node on which you invoke your DataStage job. Make sure that INFORMIXDIR points to the installation directory of your INFORMIX server. Make sure that INFORMIXSQLHOSTS points to the sql hosts path (e.g., /disk6/informix/informix_runtime/etc/sqlhosts).

3 4

Considerations for Using the High Performance Loader (HPL)


You can read and write data to an Informix 7.x or 9.x database using the Informix High Performance Loader by specifying a connection method of HPL in the input or output properties (see page 15-12 and page 15-18). Note the following when reading or writing using the High Performance Loader: The INFORMIX onpload database must exist and be set up. You do this by running the INFORMIX ipload utility once and exiting it. An appropriate warning appears if the database is not set up properly.

15-2

Parallel Job Developers Guide

Informix Enterprise Stage

Accessing Informix Databases

The High Performance Loader uses more shared memory, and therefore more semaphores, than INFORMIX does in general. If the HPL is unable to allocate enough shared memory or semaphores, the DataStage read or write may not work. For more information about shared memory limits, contact your system administrator.

Reading Data on a Remote Machine using HPL


You can use read data on a remote machine using the High Performance Loader without having INFORMIX installed on your local machine. This uses the HPL connect method in the Output properties (see "Connection Method" on page 15-18). The machines must be cross-mounted in order to make a remote connection. These instructions assume that DataStage has already been installed on your local machine and that the Parallel engine is available on the remote machine. (See the section "Copying the Parallel Engine to Your System Nodes" in the DataStage Install and Upgrade Guide.) To establish a remote connection to an Informix Enterprise Stage:
1

Verify that the INFORMIX sqlhosts file on the remote machine has a TCP interface. A TCP interface is necessary to use the remote connection functionality. Copy the INFORMIX etc/sqlhosts file from the remote machine to a directory on your local machine. Set the INFORMIX INFORMIXDIR environment variable to this directory. For example, if the directory on the local machine is /apt/informix, the sqlhosts file should be in the directory /apt/informix/etc, and the INFORMIXDIR variable should be set to /apt/informix.

3 4

Set the INFORMIXSERVER environment variable to the name of the remote INFORMIX server. Add the remote INFORMIX server nodes to your PX node configuration file located in $APT_ORCHHOME/../../config; and use a nodepool resource constraint to limit the execution of the Informix Enterprise Stage to these nodes. In the example configuration file below, the local machine is fastname local_machine, and the INFORMIX remote server machine is fastname remote_machine. The nodepool for the remote nodes is arbitrarily named InformixServer". The configuration file must contain at least two nodes, one for the local machine and one for the remote machine. Here is the DataStage example configuration file before any changes have been made:

Parallel Job Developers Guide

15-3

Accessing Informix Databases

Informix Enterprise Stage

{ node "node0" { fastname "local_machine" pools "" "node0" "local_machine"resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } node "node1" { fastname "local_machine" pools "" "node1" "local_machine" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } }

Here is the DataStage example configuration file with changes made for the Informix Enterprise Stage:
{ node "node0" { fastname "local_machine" pools "" "local_machine" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } node "node1" { fastname "local_machine" pools "" "local_machine" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } node "node2" { fastname "remote_machine" pools "InformixServer" "remote_machine" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } node "node3" { fastname "remote_machine" pools "InformixServer" "remote_machine" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } }

15-4

Parallel Job Developers Guide

Informix Enterprise Stage

Accessing Informix Databases

Go to the Stage page Advanced tab of the Informix Enterprise Stage (see page 15-9). Select Node pool and resource constraints and Nodepool along with the name of the node pool constraint (i.e., "InformixServer" in the example configuration file above). Set up environment variables. Remote access to an INFORMIX database requires the use of two INFORMIXDIR environment variable settings, one for the local DataStage machine which is set as in step 2 above, and one for the machine with the remote INFORMIX database. The remote variable needs to be set in a startup script which you must create on the local machine. This startup script is executed automatically by the Parallel Engine. Here is a sample startup.apt file with INFORMIXDIR being set to / usr/informix/9.4, the INFORMIX directory on the remote machine:
#! /bin/sh INFORMIXDIR=/usr/informix/9.4 export INFORMIXDIR INFORMIXSQLHOSTS=$INFORMIXDIR/etc/sqlhosts export INFORMIXSQLHOSTS shift 2 exec $*

Set the environment variable APT_STARTUP_SCRIPT to the full pathname of the startup.apt file.

You are now ready to run a DataStage job which uses the Informix Enterprise Stage HPL read method to connect to a remote INFORMIX server. If you are unable to connect to the remote server, try making either one or both of the following changes to your sqlhosts file on the local machine: In the fourth column in the row corresponding to the remote INFORMIX server name, replace the INFORMIX server name with the INFORMIX server port number found in the /etc/services file on the remote machine. The third column contains the hostname of the remote machine. Change this to the IP address of the remote machine.

Using Informix XPS Stages on AIX Systems


In order to run jobs containing Informix XPS stages on AIX systems, you need to have the Informix client sdk 2.81 version installed along with the Informix XPS server. The LIBPATH order should be set as follows:
LIBPATH=$APT_ORCHHOME/lib:$INFORMIXDIR/lib:`dirname $DSHOME`/ branded_odbc/lib:$DSHOME/lib:$DSHOME/uvdlls:$DSHOME/java/jre/bin/ classic:$DSHOME/java/jre/bin:$INFORMIXDIR/lib:$INFORMIXDIR/lib/ cli:$INFORMIXDIR/lib/esql

Parallel Job Developers Guide

15-5

Accessing Informix Databases

Informix Enterprise Stage

Type Conversions - Writing to Informix


When writing or loading, the Informix Enterprise stage automatically converts DataStage data types to Informix data types as shown in the following table:
DataStage SQL Data Type
Unknown Char LongVarChar VarChar

Underlying Data Type Informix Data Type


string[n] string[max = n] variable length string with maximum length = n date date, time or timestamp decimal[p, s] dfloat dfloat sfloat int32 int16 int8 DATE DATETIME DECIMAL(p, s) DOUBLE_PRECISION FLOAT FLOAT INTEGER SMALLINT SMALLINT CHAR(n) VARCHAR(n)

Date Date, Time, or Timestamp Decimal Numeric Double Double Float Real Integer SmallInt TinyInt

The default length of VARCHAR is 32 bytes. That is, 32 bytes are allocated for each variable-length string field in the input data set. If an input variable-length string field is longer than 32 bytes, the stage issues a warning.

Type Conversions - Reading from Informix


When reading, the Informix Enterprise stage automatically converts Informix data types to DataStage data types as shown in the following table:
DataStage SQL Data Type
Unknown Char

Underlying Data Type Informix Data Type


string[n] CHAR(n)

15-6

Parallel Job Developers Guide

Informix Enterprise Stage

Must Dos

DataStage SQL Data Type


LongVarChar VarChar

Underlying Data Type Informix Data Type


string[max = n] variable length string with maximum length = n string[n] string[max = n] variable length string with maximum length = n date date, time or timestamp decimal[p, s] dfloat dfloat sfloat decimal sfloat int32 int32 int16 DATE DATETIME DECIMAL(p, s) DOUBLE_PRECISION FLOAT SMALLFLOAT MONEY REAL INTEGER SERIAL SMALLINT NCHAR(n, r) NVARCHAR(n, r) CHARACTER VARYING(n, r)

Unknown Char LongVarChar VarChar

Date Date, Time, or Timestamp Decimal Numeric Double Double Float Real Decimal Numeric Float Real Integer Integer SmallInt

Must Dos
DataStage has many defaults which means that it can be very easy to include Informix Enterprise Stages in a job. This section specifies the minimum steps to take to get an Informix Enterprise Stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. The steps required depend on what you are using an Informix Enterprise Stage for.

Parallel Job Developers Guide

15-7

Must Dos

Informix Enterprise Stage

Writing an Informix Database


In the Input Link Properties Tab, under the Target category:

Specify the Table you are writing. Specify the write mode (by default DataStage appends to existing tables, you can also choose to create a new table, replace an existing table, or keep existing table details but replace all the rows).

Under the Connection category:

Specify the connection method, this can be one of XPS Fast (for connecting to the XPS framework directly), HPL (for connecting to HPL servers), or Native (for connecting to any version release 7.x and above). Optionally specify the name of the database you are connecting to. If you have specified the XPS Fast or HPL Connection Method, specify the name of the server hosting Informix XPS (by default DataStage will take this from the INFORMIXSERVER environment variable see "Accessing Informix Databases" on page 15-2).

Ensure column meta data has been specified for the write.

Reading an Informix Database


In the Output Link Properties Tab, under the Source category:

Choose a Read Method. This is Table by default, which reads directly from a table, but you can also choose to read using auto-generated SQL or user-generated SQL. Specify the table to be read. If using a Read Method of user-generated SQL, specify the SELECT SQL statement to use. DataStage provides the autogenerated statement as a basis, which you can edit as required.

Under the Connection category:

Specify the connection method, this can be one of XPS Fast (for connecting to the XPS framework directly), HPL (for connecting to HPL servers), or Native (for connecting to any version release 7.x and above). Optionally specify the name of the database you are connecting to.

15-8

Parallel Job Developers Guide

Informix Enterprise Stage

Stage Page

If you have specified the XPS Fast or HPL Connection Method, specify the name of the server hosting Informix XPS (by default DataStage will take this from the INFORMIXSERVER environment variable see "Accessing Informix Databases" on page 15-2).

Ensure column meta data has been specified for the read.

Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The execution mode depends on the type of operation the stage is performing.

Writing to an XPS database using the XPS Fast connection method is always parallel, and cannot be changed. Writing to a database using the HPL connection method is always sequential and cannot be changed. Writing to a database using the Native connection method is always sequential and cannot be changed. Reading an database using the HPL connection method is always sequential and cannot be changed. The execution mode for reading an XPS database depends on the setting of the Connection Method and Partition Table properties.

Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. You can select Set or Clear. If you select Set read operations will request that the next stage preserves the partitioning as is (it is ignored for write operations). Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

Parallel Job Developers Guide

15-9

Inputs Page

Informix Enterprise Stage

Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about how the Informix Enterprise Stage writes data to an Informix database. The stage can have only one input link writing to one table. The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the database. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about stage properties, partitioning, and formatting are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Input Link Properties Tab


The Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and where. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Target/Write Mode

Values
Append/ Create/ Replace/ Truncate Table Name

Default
Append

Mandatory?
Y

Repeats?
N

Dependent of
N/A

Target/Table

N/A

N/A

15-10

Parallel Job Developers Guide

Informix Enterprise Stage

Inputs Page

Category/ Property
Connection/ Connection Method Connection/ Remote Server

Values
XPS Fast/ HPL/Native True/False

Default
XPS Fast

Mandatory?
Y

Repeats?
N

Dependent of
N/A

False

Y (if Connection Method = Native) Y (if Connection Method = Native and Remote Server = True) Y (if Connection Method = Native and Remote Server = True) Y N N N Y

N/A

Connection/User

User id

N/A

N/A

Connection/ Password

Password

N/A

N/A

Connection/ Database Connection/ Server Options/Close Command Options/Open Command Options/Silently Drop Columns Not in Table Options/Default String Length

Database Name Server Name Close Command Open Command True/False

N/A N/A N/A N/A False

N N N N N

N/A N/A N/A N/A N/A

String Length 32

N/A

Target Category
Write Mode Select from the following: Append. Appends new records to the table. The database user who writes in this mode must have Resource privileges. This is the default mode.

Parallel Job Developers Guide

15-11

Inputs Page

Informix Enterprise Stage

Create. Creates a new table. The database user who writes in this mode must have Resource privileges. The stage returns an error if the table already exists. Replace. Deletes the existing table and creates a new one in its place. The database user who writes in this mode must have Resource privileges. Note that you cannot create or replace a table that has primary keys, you should not specify primary keys in your meta data. Truncate. Retains the table attributes but discards existing records and appends new ones. The stage will run more slowly in this mode if the user does not have Resource privileges. Table Specify the name of the Informix table to write to. It has a dependent property: Select List Specifies a list that determines which columns are written. If you do not supply the list, the stage writes to all columns.

Connection Category
Connection Method Specify the method to use to connect to the Informix database. Choose from: XPS fast. Use this to connect to an Informix XPS (8.x) database. DataStage connects directly to the XPS framework. HPL. Use this to connect to Informix servers (7.x, 9.x) using the High Performance Loader (HPL). Native. Use this to connect to any version of Informix (7.x, 8.x, or 9.x) using native interfaces. Remote Server This option appears if you select the Native connection method. It is False by default. If you select True, the Password and User options appear, allowing you to specify authentication details for the remote server. User This is only available for a Connection Method of Native with Remote Server set to true. Specify the user id for connecting to the remote database.
15-12 Parallel Job Developers Guide

Informix Enterprise Stage

Inputs Page

Password This is only available for a Connection Method of Native with Remote Server set to true. Specify the password for connecting to the remote database with the user id specified by User. The password is encrypted. Database Optionally specify the name of the Informix database containing the table specified by the Table property. Server This is only available with a Connection Method of XPS Fast or HPL. Specify the name of an Informix XPS server.

Option Category
Close Command Specify an INFORMIX SQL statement to be parsed and executed by Informix on all processing nodes after the table has been populated. Open Command Specify an INFORMIX SQL statement to be parsed and executed by Informix on all processing nodes before opening the table. Silently Drop Columns Not in Table Use this property to cause the stage to drop, with a warning, all input columns that do not correspond to the columns of an existing table. If do you not specify drop, an unmatched column generates an error and the associated step terminates. Default String Length Set the default length of string columns. If you do not specify a length, the default is 32 bytes. You can specify a length up to 255 bytes.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the Informix database. It also allows you to specify that the data should be sorted before being written. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of
Parallel Job Developers Guide 15-13

Inputs Page

Informix Enterprise Stage

current and preceding stages and how many nodes are specified in the Configuration file. If the stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Informix Enterprise Stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag columns. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available:

15-14

Parallel Job Developers Guide

Informix Enterprise Stage

Inputs Page

(Auto). This is the default collection method for Informix Enterprise Stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the database. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Parallel Job Developers Guide

15-15

Outputs Page

Informix Enterprise Stage

Outputs Page
The Outputs page allows you to specify details about how the Informix Enterprise Stage reads data from an Informix database. The stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. Details about Informix Enterprise Stage properties are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Output Link Properties Tab


The Properties tab allows you to specify properties for the output link. These dictate how incoming data is read and from what table. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/Property Values
Source/Read Method Table/Autogenerated SQL/Userdefined SQL Table Name

Default
Table

Mandatory?
Y

Repeats? Dependent of
N N/A

Source/Table

Y (if Read Method = Table or Autogenerated SQL) N N N

N/A

Source/Select List Source/Where Clause Source/Partition Table Source/Query

List Filter Table SQL query

N/A N/A N/A N/A

N N N

Table Table Table N/A

Y (if Read N Method = Userdefined SQL or Autogenerated SQL)

15-16

Parallel Job Developers Guide

Informix Enterprise Stage

Outputs Page

Category/Property Values
Connection/ Connection Method Connection/Remote Server XPS Fast/ HPL/Native True/False

Default
XPS Fast False

Mandatory?
Y Y (if Connection Method = Native) Y (if Connection Method = Native and Remote Server = True) Y (if Connection Method = Native and Remote Server = True) N N N N

Repeats? Dependent of
N N N/A N/A

Connection/User

User id

N/A

N/A

Connection/Password

Password

N/A

N/A

Connection/Database Connection/Server Options/Close Command Options/Open Command

Database Name Server Name String String

N/A N/A N/A N/A

N N N N

N/A N/A N/A N/A

Source Category
Read Method Select Table to use the Table property to specify the read (this is the default). Select Auto-generated SQL this to have DataStage automatically generate an SQL query based on the columns you have defined and the table you specify in the Table property. Select Userdefined SQL to define your own query. Table Specify the name of the Informix table to read from. The table must exist. You can prefix the table name with a table owner in the form: table_owner.table_name.

Parallel Job Developers Guide

15-17

Outputs Page

Informix Enterprise Stage

Where Clause Specify selection criteria to be used as part of an SQL statements WHERE clause, to specify the rows of the table to include in or exclude from the data set. Select List Specifies a list that determines which columns are read. If you do not supply the list, the stage reads all columns. Do not include formatting characters in the list. Partition Table Specify this property if the table is fragmented to improve performance by creating one instance of the stage per table fragment. If the table is fragmented across nodes, this property creates one instance of the stage per fragment per node. If the table is fragmented and you do not specify this option, the stage nonetheless functions successfully, if more slowly. You must have Resource privilege to invoke this property. This property is only available for Connection Methods of XPS Fast and Native. These dependent properties are only available when you have specified a Read Method of Table rather than Auto-generated SQL. Query This property is used to contain the SQL query when you choose a Read Method of User-defined query or Auto-generated SQL. If you are using Auto-generated SQL you must select a table and specify some column definitions to have DataStage generate the query.

Connection Category
Connection Method Specify the method to use to connect to the Informix database. Choose from: XPS fast. Use this to connect to an Informix XPS (8.x) database. DataStage connects directly to the XPS framework. HPL. Use this to connect to Informix servers (7.x, 9.x) using the High Performance Loader (HPL). Native. Use this to connect to any version of Informix (7.x, 8.x, or 9.x) using native interfaces.

15-18

Parallel Job Developers Guide

Informix Enterprise Stage

Outputs Page

Remote Server This option appears if you select the Native connection method. It is False by default. If you select True, the Password and User options appear, allowing you to specify authentication details for the remote server. User This is only available for a Connection Method of Native with Remote Server set to true. Specify the user id for connecting to the remote database. Password This is only available for a Connection Method of Native with Remote Server set to true. Specify the password for connecting to the remote database with the user id specified by User. The password is encrypted. Database The name of the Informix database. Server This is only available with a Connection Method of XPS Fast or HPL. The name of the Informix XPS server.

Options Category
Close Command Optionally specify an INFORMIX SQL statement to be parsed and executed on all processing nodes after the table selection or query is completed. Open Command Optionally specify an INFORMIX SQL statement to be parsed and executed by the database on all processing nodes before the read query is prepared and executed.

Parallel Job Developers Guide

15-19

Outputs Page

Informix Enterprise Stage

15-20

Parallel Job Developers Guide

16
Transformer Stage
The Transformer stage is a processing stage. It appears under the processing category in the tool palette. Transformer stages allow you to create transformations to apply to your data. These transformations can be simple or complex and can be applied to individual columns in your data. Transformations are specified using a set of functions. Details of available functions are given in Appendix B. Transformer stages can have a single input and any number of outputs. It can also have a reject link that takes any rows which have not been written to any of the outputs links by reason of a write failure or expression evaluation failure.

Unlike most of the other stages in a Parallel job, the Transformer stage has its own user interface. It does not use the generic interface as described in Chapter 3. When you edit a Transformer stage, the Transformer Editor appears. An example Transformer stage is shown below. The left pane

Parallel Job Developers Guide

16-1

Must Dos

Transformer Stage

represents input data and the right pane, output data. In this example, the Transformer stage has a single input and output link and meta data has been defined for both.

Must Dos
This section specifies the minimum steps to take to get a Transformer stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. In the left pane:

Ensure that you have column meta data defined. Ensure that you have column meta data defined for each of the output links. The easiest way to do this is to drag columns across from the input link. Define the derivation for each of your output columns. You can leave this as a straight mapping from an input column, or explicitly define an expression to transform the data before it is output.

In the right pane:

16-2

Parallel Job Developers Guide

Transformer Stage

Transformer Editor Components

Optionally specify a constraint for each output link. This is an expression which input rows must satisfy before they are output on a link. Rows that are not output on any of the links can be output on the otherwise link. Optionally specify one or more stage variables. This provides a method of defining expressions which can be reused in your output columns derivations (stage variables are only visible within the stage).

Transformer Editor Components


The Transformer Editor has the following components.

Toolbar
The Transformer toolbar contains the following buttons:
show all or selected relations stage column auto-match show/hide properties input link stage variables find/replace execution order output link execution order constraints cut copy save column definition paste load column definition

Link Area
The top area displays links to and from the Transformer stage, showing their columns and the relationships between them. The link area is where all column definitions and stage variables are defined. The link area is divided into two panes; you can drag the splitter bar between them to resize the panes relative to one another. There is also a horizontal scroll bar, allowing you to scroll the view left or right. The left pane shows the input link, the right pane shows output links. Output columns that have no derivation defined are shown in red. Within the Transformer Editor, a single link may be selected at any one time. When selected, the links title bar is highlighted, and arrowheads indicate any selected columns within that link.

Parallel Job Developers Guide

16-3

Transformer Editor Components

Transformer Stage

Meta Data Area


The bottom area shows the column meta data for input and output links. Again this area is divided into two panes: the left showing input link meta data and the right showing output link meta data. The meta data for each link is shown in a grid contained within a tabbed page. Click the tab to bring the required link to the front. That link is also selected in the link area. If you select a link in the link area, its meta data tab is brought to the front automatically. You can edit the grids to change the column meta data on any of the links. You can also add and delete meta data. As with column meta data grids on other stage editors, edit row in the context menu allows editing of the full meta data definitions (see "Columns Tab" on page 3-26).

Shortcut Menus
The Transformer Editor shortcut menus are displayed by right-clicking the links in the links area. There are slightly different menus, depending on whether you rightclick an input link, an output link, or a stage variable. The input link menu offers you operations on input columns, the output link menu offers you operations on output columns and their derivations, and the stage variable menu offers you operations on stage variables. The shortcut menu enables you to: Open the Stage Properties dialog box in order to specify stage or link properties. Open the Constraints dialog box to specify a constraint (only available for output links). Open the Column Auto Match dialog box. Display the Find/Replace dialog box. Display the Select dialog box. Edit, validate, or clear a derivation, or stage variable. Edit several derivations in one operation. Append a new column or stage variable to the selected link. Select all columns on a link. Insert or delete columns or stage variables.

16-4

Parallel Job Developers Guide

Transformer Stage

Transformer Stage Basic Concepts

Cut, copy, and paste a column or a key expression or a derivation or stage variable. If you display the menu from the links area background, you can: Open the Stage Properties dialog box in order to specify stage or link properties. Open the Constraints dialog box in order to specify a constraint for the selected output link. Open the Link Execution Order dialog box in order to specify the order in which links should be processed. Toggle between viewing link relations for all links, or for the selected link only. Toggle between displaying stage variables and hiding them. Right-clicking in the meta data area of the Transformer Editor opens the standard grid editing shortcut menus.

Transformer Stage Basic Concepts


When you first edit a Transformer stage, it is likely that you will have already defined what data is input to the stage on the input link. You will use the Transformer Editor to define the data that will be output by the stage and how it will be transformed. (You can define input data using the Transformer Editor if required.) This section explains some of the basic concepts of using a Transformer stage.

Input Link
The input data source is joined to the Transformer stage via the input link.

Output Links
You can have any number of output links from your Transformer stage. You may want to pass some data straight through the Transformer stage unaltered, but its likely that youll want to transform data from some input columns before outputting it from the Transformer stage. You can specify such an operation by entering a transform expression. The source of an output link column is defined in that columns Derivation cell within the Transformer Editor. You can use the

Parallel Job Developers Guide

16-5

Transformer Stage Basic Concepts

Transformer Stage

Expression Editor to enter expressions in this cell. You can also simply drag an input column to an output columns Derivation cell, to pass the data straight through the Transformer stage. In addition to specifying derivation details for individual output columns, you can also specify constraints that operate on entire output links. A constraint is an expression that specifies criteria that data must meet before it can be passed to the output link. You can also specify a constraint otherwise link, which is an output link that carries all the data not output on other links, that is, columns that have not met the criteria. Each output link is processed in turn. If the constraint expression evaluates to TRUE for an input row, the data row is output on that link. Conversely, if a constraint expression evaluates to FALSE for an input row, the data row is not output on that link. Constraint expressions on different links are independent. If you have more than one output link, an input row may result in a data row being output from some, none, or all of the output links. For example, if you consider the data that comes from a paint shop, it could include information about any number of different colors. If you want to separate the colors into different files, you would set up different constraints. You could output the information about green and blue paint on LinkA, red and yellow paint on LinkB, and black paint on LinkC. When an input row contains information about yellow paint, the LinkA constraint expression evaluates to FALSE and the row is not output on LinkA. However, the input data does satisfy the constraint criterion for LinkB and the rows are output on LinkB. If the input data contains information about white paint, this does not satisfy any constraint and the data row is not output on Links A, B or C, but will be output on the otherwise link. The otherwise link is used to route data to a table or file that is a catch-all for rows that are not output on any other link. The table or file containing these rows is represented by another stage in the job design. You can also specify another output link which takes rows that have not be written to any other links because of write failure or expression evaluation failure. This is specified outside the stage by adding a link and converting it to a reject link using the shortcut menu. This link is not shown in the Transformer meta data grid, and derives its meta data from the input link. Its column values are those in the input row that failed to be written. If you have enabled Runtime Column Propagation for an output link (see "Outputs Page" on page 16-34) you do not have to specify meta data for that link.
16-6 Parallel Job Developers Guide

Transformer Stage

Editing Transformer Stages

Editing Transformer Stages


The Transformer Editor enables you to perform the following operations on a Transformer stage: Create new columns on a link Delete columns from within a link Move columns within a link Edit column meta data Define output column derivations Define link constraints and handle otherwise links Specify the order in which links are processed Define local stage variables

Using Drag and Drop


Many of the Transformer stage edits can be made simpler by using the Transformer Editors drag and drop functionality. You can drag columns from any link to any other link. Common uses are: Copying input columns to output links Moving columns within a link Copying derivations in output links To use drag and drop:
1 2

Click the source cell to select it. Click the selected cell again and, without releasing the mouse button, drag the mouse pointer to the desired location within the target link. An insert point appears on the target link to indicate where the new cell will go. Release the mouse button to drop the selected cell.

You can drag and drop multiple columns, key expressions, or derivations. Use the standard Explorer keys when selecting the source column cells, then proceed as for a single cell. You can drag and drop the full column set by dragging the link title. You can add a column to the end of an existing derivation by holding down the Ctrl key as you drag the column. The drag and drop insert point is shown below:

Parallel Job Developers Guide

16-7

Editing Transformer Stages

Transformer Stage

Find and Replace Facilities


If you are working on a complex job where several links, each containing several columns, go in and out of the Transformer stage, you can use the find/replace column facility to help locate a particular column or expression and change it. The find/replace facility enables you to: Find and replace a column name Find and replace expression text Find the next empty expression Find the next expression that contains an error To use the find/replace facilities, do one of the following: Click the find/replace button on the toolbar Choose find/replace from the link shortcut menu Type Ctrl-F The Find and Replace dialog box appears. It has three tabs: Expression Text. Allows you to locate the occurrence of a particular string within an expression, and replace it if required. You can search up or down, and choose to match case, match whole words, or neither. You can also choose to replace all occurrences of the string within an expression. Columns Names. Allows you to find a particular column and rename it if required. You can search up or down, and choose to match case, match the whole word, or neither. Expression Types. Allows you to find the next empty expression or the next expression that contains an error. You can also press Ctrl-M to find the next empty expression or Ctrl-N to find the next erroneous expression.
Note The find and replace results are shown in the color specified in Tools Options.

16-8

Parallel Job Developers Guide

Transformer Stage

Editing Transformer Stages

Press F3 to repeat the last search you made without opening the Find and Replace dialog box.

Select Facilities
If you are working on a complex job where several links, each containing several columns, go in and out of the Transformer stage, you can use the select column facility to select multiple columns. This facility is also available in the Mapping tabs of certain Parallel job stages. The select facility enables you to: Select all columns/stage variables whose expressions contains text that matches the text specified. Select all column/stage variables whose name contains the text specified (and, optionally, matches a specified type). Select all columns/stage variable with a certain data type. Select all columns with missing or invalid expressions. To use the select facilities, choose Select from the link shortcut menu. The Select dialog box appears. It has three tabs: Expression Text. This Expression Text tab allows you to select all columns/stage variables whose expressions contain text that matches the text specified. The text specified is a simple text match, taking into account the Match case setting. Column Names. The Column Names tab allows you to select all column/stage variables whose Name contains the text specified. There is an additional Data Type drop down list, that will limit the columns selected to those with that data type. You can use the Data Type drop down list on its own to select all columns of a certain data type. For example, all string columns can be selected by leaving the text field blank, and selecting String as the data type. The data types in the list are generic data types, where each of the column SQL data types belong to one of these generic types. Expression Types. The Expression Types tab allows you to select all columns with either empty expressions or invalid expressions.

Creating and Deleting Columns


You can create columns on links to the Transformer stage using any of the following methods: Select the link, then click the load column definition button in the toolbar to open the standard load columns dialog box.
Parallel Job Developers Guide 16-9

Editing Transformer Stages

Transformer Stage

Use drag and drop or copy and paste functionality to create a new column by copying from an existing column on another link. Use the shortcut menus to create a new column definition. Edit the grids in the links meta data tab to insert a new column. When copying columns, a new column is created with the same meta data as the column it was copied from. To delete a column from within the Transformer Editor, select the column you want to delete and click the cut button or choose Delete Column from the shortcut menu.

Moving Columns Within a Link


You can move columns within a link using either drag and drop or cut and paste. Select the required column, then drag it to its new location, or cut it and paste it in its new location.

Editing Column Meta Data


You can edit column meta data from within the grid in the bottom of the Transformer Editor. Select the tab for the link meta data that you want to edit, then use the standard DataStage edit grid controls. The meta data shown does not include column derivations since these are edited in the links area.

Defining Output Column Derivations


You can define the derivation of output columns from within the Transformer Editor in five ways: If you require a new output column to be directly derived from an input column, with no transformations performed, then you can use drag and drop or copy and paste to copy an input column to an output link. The output columns will have the same names as the input columns from which they were derived. If the output column already exists, you can drag or copy an input column to the output columns Derivation field. This specifies that the column is directly derived from an input column, with no transformations performed. You can use the column auto-match facility to automatically set that output columns are derived from their matching input columns.

16-10

Parallel Job Developers Guide

Transformer Stage

Editing Transformer Stages

You may need one output link column derivation to be the same as another output link column derivation. In this case you can use drag and drop or copy and paste to copy the derivation cell from one column to another. In many cases you will need to transform data before deriving an output column from it. For these purposes you can use the Expression Editor. To display the Expression Editor, double-click on the required output link column Derivation cell. (You can also invoke the Expression Editor using the shortcut menu or the shortcut keys.)
Note To access a vector element in a column derivation, you need to use an expression containing the vector function see "Vector Function" on page B-11.

If a derivation is displayed in red (or the color defined in Tools Options), it means that the Transformer Editor considers it incorrect. Once an output link column has a derivation defined that contains any input link columns, then a relationship line is drawn between the input column and the output column, as shown in the following example. This is a simple example; there can be multiple relationship lines either in or out of columns. You can choose whether to view the relationships for all links, or just the relationships for the selected links, using the button in the toolbar.

Column Auto-Match Facility


This time-saving feature allows you to automatically set columns on an output link to be derived from matching columns on an input link. Using this feature you can fill in all the output link derivations to route data from corresponding input columns, then go back and edit individual output link columns where you want a different derivation. To use this facility:
1

Do one of the following:


Click the Auto-match button in the Transformer Editor toolbar. Choose Auto-match from the input link header or output link header shortcut menu.

Parallel Job Developers Guide

16-11

Editing Transformer Stages

Transformer Stage

The Column Auto-Match dialog box appears:

2 3

Choose the output link that you want to match columns with the input link from the drop down list. Click Location match or Name match from the Match type area. If you choose Location match, this will set output column derivations to the input link columns in the equivalent positions. It starts with the first input link column going to the first output link column, and works its way down until there are no more input columns left. If you choose Name match, you need to specify further information for the input and output columns as follows:

Input columns: Match all columns or Match selected columns. Choose one of these to specify whether all input link columns should be matched, or only those currently selected on the input link. Ignore prefix. Allows you to optionally specify characters at the front of the column name that should be ignored during the matching procedure. Ignore suffix. Allows you to optionally specify characters at the end of the column name that should be ignored during the matching procedure.

Output columns: Ignore prefix. Allows you to optionally specify characters at the front of the column name that should be ignored during the matching procedure.

16-12

Parallel Job Developers Guide

Transformer Stage

Editing Transformer Stages

Ignore suffix. Allows you to optionally specify characters at the end of the column name that should be ignored during the matching procedure.

Ignore case. Select this check box to specify that case should be ignored when matching names. The setting of this also affects the Ignore prefix and Ignore suffix settings. For example, if you specify that the prefix IP will be ignored, and turn Ignore case on, then both IP and ip will be ignored.

Click OK to proceed with the auto-matching.

Note Auto-matching does not take into account any data type incompatibility between matched columns; the derivations are set regardless.

Editing Multiple Derivations


You can make edits across several output column or stage variable derivations by choosing Derivation Substitution from the shortcut menu. This opens the Expression Substitution dialog box. The Expression Substitution dialog box allows you to make the same change to the expressions of all the currently selected columns within a link. For example, if you wanted to add a call to the trim() function around all the string output column expressions in a link, you could do this in two steps. First, use the Select dialog to select all the string output columns. Then use the Expression Substitution dialog to apply a trim() call around each of the existing expression values in those selected columns. You are offered a choice between Whole expression substitution and Part of expression substitution.

Whole Expression
With this option the whole existing expression for each column is replaced by the replacement value specified. This replacement value can be a completely new value, but will usually be a value based on the original expression value. When specifying the replacement value, the existing value of the columns expression can be included in this new value by including $1 This can be included any number of . times. For example, when adding a trim() call around each expression of the currently selected column set, having selected the required columns, you would:

Parallel Job Developers Guide

16-13

Editing Transformer Stages

Transformer Stage

1 2

Select the Whole expression option. Enter a replacement value of:


trim($1)

Click OK

Where a columns original expression was:


DSLink3.col1

This will be replaced by:


trim(DSLink3.col1)

This is applied to the expressions in each of the selected columns. If you need to include the actual text $1 in your expression, enter it as $$1 .

Part of Expression
With this option, only part of each selected expression is replaced rather than the whole expression. The part of the expression to be replaced is specified by a Regular Expression match. It is possible that more that one part of an expression string could match the Regular Expression specified. If Replace all occurrences is checked, then each occurrence of a match will be updated with the replacement value specified. If it is not checked, then just the first occurrence is replaced. When replacing part of an expression, the replacement value specified can include that part of the original expression being replaced. In order to do this, the Regular Expression specified must have round brackets around its value. "$1 in the replacement value will then represent that matched text. If the Regular Expression is not surrounded by round brackets, then $1 will simply be the text $1 . For complex Regular Expression usage, subsets of the Regular Expression text can be included in round brackets rather than the whole text. In this case, the entire matched part of the original expression is still replaced, but $1 $2 etc can be used to refer to , each matched bracketed part of the Regular Expression specified. The following is an example of the Part of expression replacement. Suppose a selected set of columns have derivations that use input columns from DSLink3. For example, two of these derivations could be:
DSLink3.OrderCount + 1 If (DSLink3.Total > 0) Then DSLink3.Total Else -1

You may want to protect the usage of these input columns from null values, and use a zero value instead of the null. To do this:

16-14

Parallel Job Developers Guide

Transformer Stage

Editing Transformer Stages

1 2 3

Select the columns you want to substitute expressions for. Select the Part of expression option. Specify a Regular Expression value of:
(DSLink3\.[a-z,A-Z,0-9]*)

This will match strings that contain DSLink3., followed by any number of alphabetic characters or digits. (This assumes that column names in this case are made up of alphabetic characters and digits). The round brackets around the whole Expression means that $1 will represent the whole matched text in the replacement value.
4

Specify a replacement value of


NullToZero($1)

This replaces just the matched substrings in the original expression with those same substrings, but surrounded by the NullToZero call.
5

Click OK, to apply this to all the selected column derivations.

From the examples above:


DSLink3.OrderCount + 1

would become
NullToZero(DSLink3.OrderCount) + 1

and
If (DSLink3.Total > 0) Then DSLink3.Total Else 1

would become:
If (NullToZero(DSLink3.Total) > 0) Then DSLink3.Total Else 1

If the Replace all occurrences option is selected, the second expression will become:
If (NullToZero(DSLink3.Total) > 0) Then NullToZero(DSLink3.Total) Else 1

The replacement value can be any form of expression string. For example in the case above, the replacement value could have been:
(If (StageVar1 > 50000) Then $1 Else ($1 + 100))

In the first case above, the expression


DSLink3.OrderCount + 1

would become:
(If (StageVar1 > 50000) Then DSLink3.OrderCount Else (DSLink3.OrderCount + 100)) + 1

Parallel Job Developers Guide

16-15

Editing Transformer Stages

Transformer Stage

Handling Null Values in Input Columns


If you use input columns in an output column expression, be aware that a null value in that input column will cause the row to be dropped or, if a reject link has been defined, rejected. This applies where: An input column is used in an output column derivation expression (for example, an expression like DSLink4.col1 + 1). An input column is used in an output column constraint. An input column is used in a stage variable derivation. It does not apply where an output column is mapped directly from an input column, with a straight assignment expression. If you need to be able to handle nulls in these situations, you should use the null handling functions described in Appendix B. For example, you could enter an output column derivation expression including the expression:
1 + NullToZero(InLink.Col1)

This would check the input column to see if it contains a null, and if it did, replace it with 0 (which is added to 1). Otherwise the value the column contains is added to 1.

Defining Constraints and Handling Otherwise Links


You can define limits for output data by specifying a constraint. Constraints are expressions and you can specify a constraint for each output link from a Transformer stage. You can also specify that a particular link is to act as an otherwise link and catch those rows that have failed to satisfy the constraints on all other output links. To define a constraint or specify an otherwise link, do one of the following: Select an output link and click the constraints button. Double-click the output links constraint entry field. Choose Constraints from the background or header shortcut menus. A dialog box appears which allows you either to define constraints for any of the Transformer output links or to define a link as an otherwise link. Define a constraint by entering an expression in the Constraint field for that link. Once you have done this, any constraints will appear below the links title bar in the Transformer Editor. This constraint expression will then be checked against the row data at runtime. If the
16-16 Parallel Job Developers Guide

Transformer Stage

Editing Transformer Stages

data does not satisfy the constraint, the row will not be written to that link. It is also possible to define a link which can be used to catch these rows which have been not satisfied the constraints on any previous links. A constraint otherwise link can be defined by: Clicking on the Otherwise/Log field so a tick appears and leaving the Constraint fields blank. This will catch any rows that have failed to meet constraints on all the previous output links. Set the constraint to OTHERWISE. This will be set whenever a row is rejected on a link because the row fails to match a constraint. OTHERWISE is cleared by any output link that accepts the row. The otherwise link must occur after the output links in link order so it will catch rows that have failed to meet the constraints of all the output links. If it is not last rows may be sent down the otherwise link which satisfy a constraint on a later link and is sent down that link as well. Clicking on the Otherwise/Log field so a tick appears and defining a Constraint. This will result in the number of rows written to that link (i.e. rows which satisfy the constraint) to be recorded in the job log as a warning message.
Note You can also specify a reject link which will catch rows that have not been written on any output links due to a write error or null expression error. Define this outside Transformer stage by adding a link and using the shortcut menu to convert it to a reject link.

Parallel Job Developers Guide

16-17

Editing Transformer Stages

Transformer Stage

Specifying Link Order


You can specify the order in which output links process a row. The initial order of the links is the order in which they are added to the stage. To reorder the links:
1

Do one of the following:


Click the output link execution order button on the Transformer Editor toolbar. Choose output link reorder from the background shortcut menu.

The Transformer Stage Properties dialog box appears with the Link Ordering tab of the Stage page uppermost:

2 3

Use the arrow buttons to rearrange the list of links in the execution order required. When you are happy with the order, click OK.

Defining Local Stage Variables


You can declare and use your own variables within a Transformer stage. Such variables are accessible only from the Transformer stage in which they are declared. They can be used as follows: They can be assigned values by expressions.

16-18

Parallel Job Developers Guide

Transformer Stage

Editing Transformer Stages

They can be used in expressions which define an output column derivation. Expressions evaluating a variable can include other variables or the variable being evaluated itself. Any stage variables you declare are shown in a table in the right pane of the links area. The table looks similar to an output link. You can display or hide the table by clicking the Stage Variable button in the Transformer toolbar or choosing Stage Variable from the background shortcut menu.
Note Stage variables are not shown in the output link meta data area at the bottom of the right pane.

The table lists the stage variables together with the expressions used to derive their values. Link lines join the stage variables with input columns used in the expressions. Links from the right side of the table link the variables to the output columns that use them. To declare a stage variable:
1

Do one of the following:

Select Insert New Stage Variable from the stage variable shortcut menu. A new variable is added to the stage variables table in the links pane. The variable is given the default name

Parallel Job Developers Guide

16-19

Editing Transformer Stages

Transformer Stage

StageVar and default data type VarChar (255). You can edit these properties using the Transformer Stage Properties dialog box, as described in the next step.

Click the Stage Properties button on the Transformer toolbar. Select Stage Properties from the background shortcut menu. Select Stage Variable Properties from the stage variable shortcut menu.

The Transformer Stage Properties dialog box appears:

Using the grid on the Variables page, enter the variable name, initial value, SQL type, extended information (if variable contains Unicode data), precision, scale, and an optional description. Variable names must begin with an alphabetic character (az, AZ) and can only contain alphanumeric characters (az, AZ, 09). Click OK. The new variable appears in the stage variable table in the links pane.

You perform most of the same operations on a stage variable as you can on an output column (see page 16-10). A shortcut menu offers the same commands. You cannot, however, paste a stage variable as a new column, or a column as a new stage variable.

16-20

Parallel Job Developers Guide

Transformer Stage

The DataStage Expression Editor

The DataStage Expression Editor


The DataStage Expression Editor helps you to enter correct expressions when you edit Transformer stages. The Expression Editor can: Facilitate the entry of expression elements Complete the names of frequently used variables Validate the expression The Expression Editor can be opened from: Output link Derivation cells Stage variable Derivation cells Constraint dialog box

Expression Format
The format of an expression is as follows:
KEY: something_like_this is a token something_in_italics is a terminal, i.e., doesn't break down any further | is a choice between tokens [ is an optional part of the construction "XXX" is a literal token (i.e., use XXX not including the quotes) ================================================= expression ::= function_call | variable_name | other_name | constant | unary_expression | binary_expression | if_then_else_expression | substring_expression | "(" expression ")" function_call ::= function_name "(" [argument_list] ")" argument_list ::= expression | expression "," argument_list function_name ::= name of a built-in function | name of a user-defined_function variable_name ::= job_parameter name | stage_variable_name | link_variable name other_name ::= name of a built-in macro, system variable, etc. constant ::= numeric_constant | string_constant numeric_constant ::= ["+" | "-"] digits ["." [digits]] ["E" | "e" ["+" | "-"] digits]

Parallel Job Developers Guide

16-21

The DataStage Expression Editor

Transformer Stage

string_constant ::= "'" [characters] "'" | """ [characters] """ | "\" [characters] "\" unary_expression ::= unary_operator expression unary_operator ::= "+" | "-" binary_expression ::= expression binary_operator expression binary_operator ::= arithmetic_operator | concatenation_operator | matches_operator | relational_operator | logical_operator arithmetic_operator ::= "+" | "-" | "*" | "/" | "^" concatenation_operator ::= ":" relational_operator ::= "=" |"EQ" | "<>" | "#" | "NE" | ">" | "GT" | ">=" | "=>" | "GE" | "<" | "LT" | "<=" | "=<" | "LE" logical_operator ::= "AND" | "OR" if_then_else_expression ::= "IF" expression "THEN" expression "ELSE" expression substring_expression ::= expression "[" [expression ["," expression] "]" field_expression ::= expression "[" expression "," expression "," expression "]" /* That is, always 3 args

Note keywords like "AND" or "IF" or "EQ" may be in any case

Entering Expressions
Whenever the insertion point is in an expression box, you can use the Expression Editor to suggest the next element in your expression. Do this by right-clicking the box, or by clicking the Suggest button to the right of the box. This opens the Suggest Operand or Suggest Operator menu. Which menu appears depends on context, i.e., whether you should be entering an operand or an operator as the next expression element. The Functions available from this menu are described in Appendix B. The DS macros are described in "Job Status Macros" in Parallel Job Advanced Developers Guide. You can also specify custom routines for use in the expression editor (see "Working with Mainframe Routines" in DataStage Manager Guide).

16-22

Parallel Job Developers Guide

Transformer Stage

The DataStage Expression Editor

Suggest Operand Menu:

Suggest Operator Menu:

Completing Variable Names


The Expression Editor stores variable names. When you enter a variable name you have used before, you can type the first few characters, then press F5. The Expression Editor completes the variable name for you. If you enter the name of the input link followed by a period, for example, DailySales., the Expression Editor displays a list of the column names of the link. If you continue typing, the list selection changes to match what you type. You can also select a column name using the mouse. Enter a selected column name into the expression by pressing Tab or Enter. Press Esc to dismiss the list without selecting a column name.

Validating the Expression


When you have entered an expression in the Transformer Editor, press Enter to validate it. The Expression Editor checks that the syntax is correct and that any variable names used are acceptable to the compiler. If there is an error, a message appears and the element causing the error is highlighted in the expression box. You can either correct the expression or close the Transformer Editor or Transform dialog box. For any expression, selecting Validate from its shortcut menu will also validate it and show any errors in a message box.

Parallel Job Developers Guide

16-23

The DataStage Expression Editor

Transformer Stage

Exiting the Expression Editor


You can exit the Expression Editor in the following ways: Press Esc (which discards changes). Press Return (which accepts changes). Click outside the Expression Editor box (which accepts changes).

Configuring the Expression Editor


You can resize the Expression Editor window by dragging. The next time you open the expression editor in the same context (for example, editing output columns) on the same client, it will have the same size. The Expression Editor is configured by editing the Designer options. This allows you to specify how helpful the expression editor is. For more information, see "Specifying Designer Options" in DataStage Designer Guide.

System Variables
DataStage provides a set of variables containing useful system information that you can access from an output derivation or constraint.
Name
@FALSE @TRUE @INROWNUM

Description
The value is replaced with 0. The value is replaced with 1. Input row counter.

@OUTROWNUM

Output row counter (per link).

@NUMPARTITIONS @PARTITIONNUM

The total number of partitions for the stage. The partition number for the particular instance.

Guide to Using Transformer Expressions and Stage Variables


In order to write efficient Transformer stage derivations, it is useful to understand what items get evaluated and when. The evaluation sequence is as follows:
Evaluate each stage variable initial value

16-24

Parallel Job Developers Guide

Transformer Stage

The DataStage Expression Editor

For each input row to process: Evaluate each stage variable derivation value, unless the derivation is empty For each output link: Evaluate each column derivation value Write the output record Next output link Next input row

The stage variables and the columns within a link are evaluated in the order in which they are displayed on the parallel job canvas. Similarly, the output links are also evaluated in the order in which they are displayed. From this sequence, it can be seen that there are certain constructs that would be inefficient to include in output column derivations, as they would be evaluated once for every output column that uses them. Such constructs are: Where the same part of an expression is used in multiple column derivations. For example, suppose multiple columns in output links want to use the same substring of an input column, then the following test may appear in a number of output columns derivations:
IF (DSLINK1.col1[1,3] = 001) THEN ...

In this case, the evaluation of the substring of DSLINK1.col[1,3] is repeated for each column that uses it. This can be made more efficient by moving the substring calculation into a stage variable. By doing this, the substring is evaluated just once for every input row. In this case, the stage variable definition for StageVar1 would be:
DSLINK1.col1[1,3]

and each column derivation would start with:


IF (StageVar1 = 001) THEN ...

In fact, this example could be improved further by also moving the string comparison into the stage variable. The stage variable would be:
IF (DSLink1.col1[1,3] = 001) THEN 1 ELSE 0

and each column derivation would start with:


IF (StageVar1) THEN

This reduces both the number of substring functions evaluated and string comparisons made in the Transformer. Where an expression includes calculated constant values.

Parallel Job Developers Guide

16-25

The DataStage Expression Editor

Transformer Stage

For example, a column definition may include a function call that returns a constant value, such as:
Str( ,20)

This returns a string of 20 spaces. In this case, the function would be evaluated every time the column derivation is evaluated. It would be more efficient to calculate the constant value just once for the whole Transformer. This can be achieved using stage variables. This function could be moved into a stage variable derivation; but in this case, the function would still be evaluated once for every input row. The solution here is to move the function evaluation into the initial value of a stage variable. A stage variable can be assigned an initial value from the Stage Properties dialog box Variables tab. In this case, the variable would have its initial value set to:
Str( , 20)

You would then leave the derivation of the stage variable on the main Transformer page empty. Any expression that previously used this function would be changed to use the stage variable instead. The initial value of the stage variable is evaluated just once, before any input rows are processed. Then, because the derivation expression of the stage variable is empty, it is not re-evaluated for each input row. Therefore, its value for the whole Transformer processing is unchanged from the initial value. In addition to a function value returning a constant value, another example would be part of an expression such as:
"abc" : "def"

As with the function-call example, this concatenation is repeated every time the column derivation is evaluated. Since the subpart of the expression is actually constant, this constant part of the expression could again be moved into a stage variable, using the initial value setting to perform the concatenation just once. Where an expression requiring a type conversion is used as a constant, or it is used in multiple places. For example, an expression may include something like this:
DSLink1.col1+"1"

In this case, the "1" is a string constant, and so, in order to be able to add it to DSLink1.col1, it must be converted from a string to an

16-26

Parallel Job Developers Guide

Transformer Stage

Transformer Stage Properties

integer each time the expression is evaluated. The solution in this case is just to change the constant from a string to an integer:
DSLink1.col1+1

In this example, if DSLINK1.col1 were a string field, then, again, a conversion would be required every time the expression is evaluated. If this just appeared once in one output column expression, this would be fine. However, if an input column is used in more than one expression, where it requires the same type conversion in each expression, then it would be more efficient to use a stage variable to perform the conversion once. In this case, you would create, for example, an integer stage variable, specify its derivation to be DSLINK1.col1, and then use the stage variable in place of DSLink1.col1, where that conversion would have been required. Note that, when using stage variables to evaluate parts of expressions, the data type of the stage variable should be set correctly for that context. Otherwise, needless conversions are required wherever that variable is used.

Transformer Stage Properties


The Transformer stage has a Properties dialog box which allows you to specify details about how the stage operates. The Transform Stage Properties dialog box has three pages: Stage Page. This is used to specify general information about the stage. Inputs Page. This is where you specify details about the data input to the Transformer stage. Outputs Page. This is where you specify details about the output links from the Transformer stage.

Stage Page
The Stage page has up to seven tabs: General. Allows you to enter an optional description of the stage. Variables. Allows you to set up stage variables for use in the stage. Advanced. Allows you to specify how the stage executes. Link Ordering. Allows you to specify the order in which the output links will be processed.
Parallel Job Developers Guide 16-27

Transformer Stage Properties

Transformer Stage

Triggers. Allows you to run certain routines at certain points in the stages execution. NLS Locale. Allows you to select a locale other than the project default to determine collating rules. Build. Allows you to override the default compiler and linker flags for this stag. The Variables tab is described in "Defining Local Stage Variables" on page 16-18. The Link Ordering tab is described in "Specifying Link Order" on page 16-18.

General Tab
In addition to the Description field, the General page also has an option which lets you control how many rejected row warnings will appear in the job log when you run the job. Whenever a row is rejected because it contains a null value, a warning is written to the job log. Potentially there could be a lot of messages, so this option allows you to set limits. By default, up to 50 messages per partition are allowed, but you can increase or decrease this, or set it to -1 to allow unlimited messages.

Advanced Tab
The Advanced tab is the same as the Advanced tab of the generic stage editor as described in "Advanced Tab" on page 3-12. This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In sequential mode the data is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is set to Propagate by default, this sets or clears the partitioning in accordance with what the previous stage has set. You can also select Set or Clear. If you select Set, the stage will request that the next stage preserves the partitioning as is. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.
16-28 Parallel Job Developers Guide

Transformer Stage

Transformer Stage Properties

Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Triggers Tab
The Triggers tab allows you to choose routines to be executed at specific execution points as the transformer stage runs in a job. The execution point is per-instance, i.e., if a job has two transformer stage instances running in parallel, the routine will be called twice, once for each instance. The available execution points are Before-stage and After-stage. At this release, the only available built-in routine is SetCustomSummaryInfo. You can also define custom routines to be executed; to do this you define a C function, make it available in UNIX shared library, and then define a Parallel routine which calls it (see "Working with Parallel Routines" in DataStage Manager Guide for details on defining a Parallel Routine). Note that the function should not return a value. SetCustomSummaryInfo is used to collect reporting information. This information is included in any XML reports generated, and can be retrieved using a number of methods: DSMakeJobReport API function (see "DSMakeJobReport" in Parallel Job Advanced Developers Guide). The DSJob -Report command line command (see"Generating a Report" in Parallel Job Advanced Developers Guide). DSJobReport used as an after-job subroutine (see "Job Properties" in DataStage Designer Guide). Each item of information collected by SetCustomSummaryInfo is stored as a variable name, a description, and a value. These appear as arguments in the Triggers tab grid (variable name in Argument 1, description in Argument 2, value in Argument 3). You can supply values for them via the expression editor. You can use job parameters

Parallel Job Developers Guide

16-29

Transformer Stage Properties

Transformer Stage

and stage variables but you cannot access data that is available only while the stage is running, such as columns.

NLS Locale Tab


This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The transformer stage uses this when it is evaluating expressions. For example, it would affect the evaluation of an expression such as if ustring1 > ustring2 Select a locale from the list, or click the arrow .

16-30

Parallel Job Developers Guide

Transformer Stage

Transformer Stage Properties

button next to the list to use a job parameter or browse for a collate file.

Build Tab
This tab allows you to override the compiler and linker flags that have been set for the job or project. The flags you specify here will take effect for this stage and this stage alone. The flags available are platform and compiler-dependent.

Parallel Job Developers Guide

16-31

Transformer Stage Properties

Transformer Stage

Inputs Page
The Inputs page allows you to specify details about data coming into the Transformer stage. The Transformer stage can have only one input link. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned. This is the same as the Partitioning tab in the generic stage editor described in "Partitioning Tab" on page 3-20. The Advanced tab allows you to change the default buffering settings for the input link.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected when input to the Transformer stage. It also allows you to specify that the data should be sorted on input. By default the Transformer stage will attempt to preserve partitioning of incoming data, or use its own partitioning method according to what the previous stage in the job dictates. If the Transformer stage is operating in sequential mode, it will first collect the data before writing it to the file using the default collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Transformer stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning type dropdown list. This will override any current partitioning. If the Transformer stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the Transformer stage. Entire. Each file written to receives the entire data set.
16-32 Parallel Job Developers Guide

Transformer Stage

Transformer Stage Properties

Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default method for the Transformer stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen. Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

Parallel Job Developers Guide

16-33

Transformer Stage Properties

Transformer Stage

Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs Page has a General tab which allows you to enter an optional description for each of the output links on the Transformer stage. It also allows you to switch Runtime column propagation on for this link, in which case data will automatically be propagated from the input link without you having to specify meta data for this output link (see "Runtime Column Propagation" on page 2-27). The Advanced tab allows you to change the default buffering settings for the output links.

16-34

Parallel Job Developers Guide

17
BASIC Transformer Stages
The BASIC Transformer stage is a processing stage. It appears under the processing category in the tool palette in the Transformer shortcut container. The BASIC Transformer stage is similar in appearance and function to the Transformer stage described in Chapter 16. It gives access to BASIC transforms and functions (BASIC is the language supported by the DataStage server engine and available in server jobs). For a description of the BASIC functions available see DataStage Server Job Developers Guide. You can only use BASIC transformer stages on SMP systems (not on MPP or cluster systems).
Note If you encounter a problem when running a job containing a BASIC transformer, you could try increasing the value of the DSIPC_OPEN_TIMEOUT environment variable in the Parallel Operator specific category of the environment variable dialog box in the DataStage Administrator (see"Setting Environment Variables" in DataStage Administrator Guide).

BASIC Transformer stages can have a single input and any number of outputs.

Parallel Job Developers Guide

17-1

Must Dos

BASIC Transformer Stages

When you edit a Transformer stage, the Transformer Editor appears. An example Transformer stage is shown below. In this example, meta data has been defined for the input and the output links.

Must Dos
This section specifies the minimum steps to take to get a BASIC Transformer stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. In the left pane:

Ensure that you have column meta data defined. Ensure that you have column meta data defined for each of the output links. The easiest way to do this is to drag columns across from the input link. Define the derivation for each of your output columns. You can leave this as a straight mapping from an input column, or explicitly define an expression to transform the data before it is output.

In the right pane:

17-2

Parallel Job Developers Guide

BASIC Transformer Stages

BASIC Transformer Editor Components

Optionally specify a constraint for each output link. This is an expression which input rows must satisfy before they are output on a link. Rows that are not output on any of the links can be output on the otherwise link. Optionally specify one or more stage variables. This provides a method of defining expressions which can be reused in your output columns derivations (stage variables are only visible within the stage).

BASIC Transformer Editor Components


The BASIC Transformer Editor has the following components.

Toolbar
The Transformer toolbar contains the following buttons:
show all or selected relations stage column auto-match show/hide properties input link stage variables find/replace execution order output link execution order constraints cut copy save column definition paste load column definition

Link Area
The top area displays links to and from the BASIC Transformer stage, showing their columns and the relationships between them. The link area is where all column definitions and stage variables are defined. The link area is divided into two panes; you can drag the splitter bar between them to resize the panes relative to one another. There is also a horizontal scroll bar, allowing you to scroll the view left or right. The left pane shows the input link, the right pane shows output links. Output columns that have no derivation defined are shown in red. Within the Transformer Editor, a single link may be selected at any one time. When selected, the links title bar is highlighted, and arrowheads indicate any selected columns.

Parallel Job Developers Guide

17-3

BASIC Transformer Editor Components

BASIC Transformer Stages

Meta Data Area


The bottom area shows the column meta data for input and output links. Again this area is divided into two panes: the left showing input link meta data and the right showing output link meta data. The meta data for each link is shown in a grid contained within a tabbed page. Click the tab to bring the required link to the front. That link is also selected in the link area. If you select a link in the link area, its meta data tab is brought to the front automatically. You can edit the grids to change the column meta data on any of the links. You can also add and delete meta data.

Shortcut Menus
The BASIC Transformer Editor shortcut menus are displayed by rightclicking the links in the links area. There are slightly different menus, depending on whether you rightclick an input link, an output link, or a stage variable. The input link menu offers you operations on input columns, the output link menu offers you operations on output columns and their derivations, and the stage variable menu offers you operations on stage variables. The shortcut menu enables you to: Open the Stage Properties dialog box in order to specify stage or link properties. Open the Constraints dialog box to specify a constraint (only available for output links). Open the Column Auto Match dialog box. Display the Find/Replace dialog box. Display the Select dialog box. Edit, validate, or clear a derivation or stage variable. Edit several derivations in one operation. Append a new column or stage variable to the selected link. Select all columns on a link. Insert or delete columns or stage variables. Cut, copy, and paste a column or a key expression or a derivation or stage variable. If you display the menu from the links area background, you can:

17-4

Parallel Job Developers Guide

BASIC Transformer Stages

BASIC Transformer Stage Basic Concepts

Open the Stage Properties dialog box in order to specify stage or link properties. Open the Constraints dialog box in order to specify a constraint for the selected output link. Open the Link Execution Order dialog box in order to specify the order in which links should be processed. Toggle between viewing link relations for all links, or for the selected link only. Toggle between displaying stage variables and hiding them. Right-clicking in the meta data area of the Transformer Editor opens the standard grid editing shortcut menus.

BASIC Transformer Stage Basic Concepts


When you first edit a Transformer stage, it is likely that you will have already defined what data is input to the stage on the input links. You will use the Transformer Editor to define the data that will be output by the stage and how it will be transformed. (You can define input data using the Transformer Editor if required.) This section explains some of the basic concepts of using a Transformer stage.

Input Link
The input data source is joined to the BASIC Transformer stage via the input link.

Output Links
You can have any number of output links from your Transformer stage. You may want to pass some data straight through the BASIC Transformer stage unaltered, but its likely that youll want to transform data from some input columns before outputting it from the BASIC Transformer stage. You can specify such an operation by entering an expression or by selecting a transform to apply to the data. DataStage has many builtin transforms, or you can define your own custom transforms that are stored in the Repository and can be reused as required.

Parallel Job Developers Guide

17-5

BASIC Transformer Stage Basic Concepts

BASIC Transformer Stages

The source of an output link column is defined in that columns Derivation cell within the Transformer Editor. You can use the Expression Editor to enter expressions or transforms in this cell. You can also simply drag an input column to an output columns Derivation cell, to pass the data straight through the BASIC Transformer stage. In addition to specifying derivation details for individual output columns, you can also specify constraints that operate on entire output links. A constraint is a BASIC expression that specifies criteria that data must meet before it can be passed to the output link. You can also specify a reject link, which is an output link that carries all the data not output on other links, that is, columns that have not met the criteria. Each output link is processed in turn. If the constraint expression evaluates to TRUE for an input row, the data row is output on that link. Conversely, if a constraint expression evaluates to FALSE for an input row, the data row is not output on that link. Constraint expressions on different links are independent. If you have more than one output link, an input row may result in a data row being output from some, none, or all of the output links. For example, if you consider the data that comes from a paint shop, it could include information about any number of different colors. If you want to separate the colors into different files, you would set up different constraints. You could output the information about green and blue paint on LinkA, red and yellow paint on LinkB, and black paint on LinkC. When an input row contains information about yellow paint, the LinkA constraint expression evaluates to FALSE and the row is not output on LinkA. However, the input data does satisfy the constraint criterion for LinkB and the rows are output on LinkB. If the input data contains information about white paint, this does not satisfy any constraint and the data row is not output on Links A, B or C, but will be output on the reject link. The reject link is used to route data to a table or file that is a catch-all for rows that are not output on any other link. The table or file containing these rejects is represented by another stage in the job design.

Before-Stage and After-Stage Routines


You can specify routines to be executed before or after the stage has processed the data. For example, you might use a before-stage routine to prepare the data before processing starts. You might use an

17-6

Parallel Job Developers Guide

BASIC Transformer Stages

Editing BASIC Transformer Stages

after-stage routine to send an electronic message when the stage has finished.

Editing BASIC Transformer Stages


The Transformer Editor enables you to perform the following operations on a BASIC Transformer stage: Create new columns on a link Delete columns from within a link Move columns within a link Edit column meta data Define output column derivations Specify before- and after-stage subroutines Define link constraints and handle rejects Specify the order in which links are processed Define local stage variables

Using Drag and Drop


Many of the BASIC Transformer stage edits can be made simpler by using the Transformer Editors drag and drop functionality. You can drag columns from any link to any other link. Common uses are: Copying input columns to output links Moving columns within a link Copying derivations in output links To use drag and drop:
1 2

Click the source cell to select it. Click the selected cell again and, without releasing the mouse button, drag the mouse pointer to the desired location within the target link. An insert point appears on the target link to indicate where the new cell will go. Release the mouse button to drop the selected cell.

You can drag and drop multiple columns or derivations. Use the standard Explorer keys when selecting the source column cells, then proceed as for a single cell. You can drag and drop the full column set by dragging the link title.

Parallel Job Developers Guide

17-7

Editing BASIC Transformer Stages

BASIC Transformer Stages

You can add a column to the end of an existing derivation by holding down the Ctrl key as you drag the column. The drag and drop insert point for creating new columns is shown below:

Find and Replace Facilities


If you are working on a complex job where several links, each containing several columns, go in and out of the BASIC Transformer stage, you can use the find/replace column facility to help locate a particular column or expression and change it. The find/replace facility enables you to: Find and replace a column name Find and replace expression text Find the next empty expression Find the next expression that contains an error To use the find/replace facilities, do one of the following: Click the find/replace button on the toolbar Choose find/replace from the link shortcut menu Type Ctrl-F The Find and Replace dialog box appears. It has three tabs: Expression Text. Allows you to locate the occurrence of a particular string within an expression, and replace it if required. You can search up or down, and choose to match case, match whole words, or neither. You can also choose to replace all occurrences of the string within an expression. Columns Names. Allows you to find a particular column and rename it if required. You can search up or down, and choose to match case, match the whole word, or neither.

17-8

Parallel Job Developers Guide

BASIC Transformer Stages

Editing BASIC Transformer Stages

Expression Types. Allows you to find the next empty expression or the next expression that contains an error. You can also press Ctrl-M to find the next empty expression or Ctrl-N to find the next erroneous expression.
Note The find and replace results are shown in the color specified in Tools Options.

Press F3 to repeat the last search you made without opening the Find and Replace dialog box.

Select Facilities
If you are working on a complex job where several links, each containing several columns, go in and out of the Transformer stage, you can use the select column facility to select multiple columns. This facility is also available in the Mapping tabs of certain Parallel job stages. The select facility enables you to: Select all columns/stage variables whose expressions contains text that matches the text specified. Select all column/stage variables whose name contains the text specified (and, optionally, matches a specified type). Select all columns/stage variable with a certain data type. Select all columns with missing or invalid expressions. To use the select facilities, choose Select from the link shortcut menu. The Select dialog box appears. It has three tabs: Expression Text. This Expression Text tab allows you to select all columns/stage variables whose expressions contain text that matches the text specified. The text specified is a simple text match, taking into account the Match case setting. Column Names. The Column Names tab allows you to select all column/stage variables whose Name contains the text specified. There is an additional Data Type drop down list, that will limit the columns selected to those with that data type. You can use the Data Type drop down list on its own to select all columns of a certain data type. For example, all string columns can be selected by leaving the text field blank, and selecting String as the data type. The data types in the list are generic data types, where each of the column SQL data types belong to one of these generic types. Expression Types. The Expression Types tab allows you to select all columns with either empty expressions or invalid expressions.

Parallel Job Developers Guide

17-9

Editing BASIC Transformer Stages

BASIC Transformer Stages

Creating and Deleting Columns


You can create columns on links to the BASIC Transformer stage using any of the following methods: Select the link, then click the load column definition button in the toolbar to open the standard load columns dialog box. Use drag and drop or copy and paste functionality to create a new column by copying from an existing column on another link. Use the shortcut menus to create a new column definition. Edit the grids in the links meta data tab to insert a new column. When copying columns, a new column is created with the same meta data as the column it was copied from. To delete a column from within the Transformer Editor, select the column you want to delete and click the cut button or choose Delete Column from the shortcut menu.

Moving Columns Within a Link


You can move columns within a link using either drag and drop or cut and paste. Select the required column, then drag it to its new location, or cut it and paste it in its new location.

Editing Column Meta Data


You can edit column meta data from within the grid in the bottom of the Transformer Editor. Select the tab for the link meta data that you want to edit, then use the standard DataStage edit grid controls. The meta data shown does not include column derivations since these are edited in the links area.

Defining Output Column Derivations


You can define the derivation of output columns from within the Transformer Editor in five ways: If you require a new output column to be directly derived from an input column, with no transformations performed, then you can use drag and drop or copy and paste to copy an input column to an output link. The output columns will have the same names as the input columns from which they were derived.

17-10

Parallel Job Developers Guide

BASIC Transformer Stages

Editing BASIC Transformer Stages

If the output column already exists, you can drag or copy an input column to the output columns Derivation field. This specifies that the column is directly derived from an input column, with no transformations performed. You can use the column auto-match facility to automatically set that output columns are derived from their matching input columns. You may need one output link column derivation to be the same as another output link column derivation. In this case you can use drag and drop or copy and paste to copy the derivation cell from one column to another. In many cases you will need to transform data before deriving an output column from it. For these purposes you can use the Expression Editor. To display the Expression Editor, double-click on the required output link column Derivation cell. (You can also invoke the Expression Editor using the shortcut menu or the shortcut keys.) If a derivation is displayed in red (or the color defined in Tools Options), it means that the Transformer Editor considers it incorrect. (In some cases this may simply mean that the derivation does not meet the strict usage pattern rules of the DataStage engine, but will actually function correctly.) Once an output link column has a derivation defined that contains any input link columns, then a relationship line is drawn between the input column and the output column, as shown in the following example. This is a simple example; there can be multiple relationship lines either in or out of columns. You can choose whether to view the relationships for all links, or just the relationships for the selected links, using the button in the toolbar.

Column Auto-Match Facility


This time-saving feature allows you to automatically set columns on an output link to be derived from matching columns on an input link. Using this feature you can fill in all the output link derivations to route data from corresponding input columns, then go back and edit individual output link columns where you want a different derivation.

Parallel Job Developers Guide

17-11

Editing BASIC Transformer Stages

BASIC Transformer Stages

To use this facility:


1

Do one of the following:


Click the Auto-match button in the Transformer Editor toolbar. Choose Auto-match from the input link header or output link header shortcut menu.

The Column Auto-Match dialog box appears:

2 3

Choose the input link and output link that you want to match columns for from the drop down lists. Click Location match or Name match from the Match type area. If you choose Location match, this will set output column derivations to the input link columns in the equivalent positions. It starts with the first input link column going to the first output link column, and works its way down until there are no more input columns left. If you choose Name match, you need to specify further information for the input and output columns as follows:

Input columns: Match all columns or Match selected columns. Choose one of these to specify whether all input link columns should be matched, or only those currently selected on the input link. Ignore prefix. Allows you to optionally specify characters at the front of the column name that should be ignored during the matching procedure.

17-12

Parallel Job Developers Guide

BASIC Transformer Stages

Editing BASIC Transformer Stages

Ignore suffix. Allows you to optionally specify characters at the end of the column name that should be ignored during the matching procedure.

Output columns: Ignore prefix. Allows you to optionally specify characters at the front of the column name that should be ignored during the matching procedure. Ignore suffix. Allows you to optionally specify characters at the end of the column name that should be ignored during the matching procedure.

Ignore case. Select this check box to specify that case should be ignored when matching names. The setting of this also affects the Ignore prefix and Ignore suffix settings. For example, if you specify that the prefix IP will be ignored, and turn Ignore case on, then both IP and ip will be ignored.

Click OK to proceed with the auto-matching.

Note Auto-matching does not take into account any data type incompatibility between matched columns; the derivations are set regardless.

Editing Multiple Derivations


You can make edits across several output column or stage variable derivations by choosing Derivation Substitution from the shortcut menu. This opens the Expression Substitution dialog box. The Expression Substitution dialog box allows you to make the same change to the expressions of all the currently selected columns within a link. For example, if you wanted to add a call to the trim() function around all the string output column expressions in a link, you could do this in two steps. First, use the Select dialog to select all the string output columns. Then use the Expression Substitution dialog to apply a trim() call around each of the existing expression values in those selected columns. You are offered a choice between Whole expression substitution and Part of expression substitution.

Whole Expression
With this option the whole existing expression for each column is replaced by the replacement value specified. This replacement value can be a completely new value, but will usually be a value based on the original expression value. When specifying the replacement value,

Parallel Job Developers Guide

17-13

Editing BASIC Transformer Stages

BASIC Transformer Stages

the existing value of the columns expression can be included in this new value by including $1 This can be included any number of . times. For example, when adding a trim() call around each expression of the currently selected column set, having selected the required columns, you would:
1 2

Select the Whole expression option. Enter a replacement value of:


trim($1)

Click OK

Where a columns original expression was:


DSLink3.col1

This will be replaced by:


trim(DSLink3.col1)

This is applied to the expressions in each of the selected columns. If you need to include the actual text $1 in your expression, enter it as $$1 .

Part of Expression
With this option, only part of each selected expression is replaced rather than the whole expression. The part of the expression to be replaced is specified by a Regular Expression match. It is possible that more that one part of an expression string could match the Regular Expression specified. If Replace all occurrences is checked, then each occurrence of a match will be updated with the replacement value specified. If it is not checked, then just the first occurrence is replaced. When replacing part of an expression, the replacement value specified can include that part of the original expression being replaced. In order to do this, the Regular Expression specified must have round brackets around its value. "$1 in the replacement value will then represent that matched text. If the Regular Expression is not surrounded by round brackets, then $1 will simply be the text $1 . For complex Regular Expression usage, subsets of the Regular Expression text can be included in round brackets rather than the whole text. In this case, the entire matched part of the original expression is still replaced, but $1 $2 etc can be used to refer to , each matched bracketed part of the Regular Expression specified. The following is an example of the Part of expression replacement.

17-14

Parallel Job Developers Guide

BASIC Transformer Stages

Editing BASIC Transformer Stages

Suppose a selected set of columns have derivations that use input columns from DSLink3. For example, two of these derivations could be:
DSLink3.OrderCount + 1 If (DSLink3.Total > 0) Then DSLink3.Total Else -1

You may want to protect the usage of these input columns from null values, and use a zero value instead of the null. To do this:
1 2 3

Select the columns you want to substitute expressions for. Select the Part of expression option. Specify a Regular Expression value of:
(DSLink3\.[a-z,A-Z,0-9]*)

This will match strings that contain DSLink3., followed by any number of alphabetic characters or digits. (This assumes that column names in this case are made up of alphabetic characters and digits). The round brackets around the whole Expression means that $1 will represent the whole matched text in the replacement value.
4

Specify a replacement value of


NullToZero($1)

This replaces just the matched substrings in the original expression with those same substrings, but surrounded by the NullToZero call.
5

Click OK, to apply this to all the selected column derivations.

From the examples above:


DSLink3.OrderCount + 1

would become
NullToZero(DSLink3.OrderCount) + 1

and
If (DSLink3.Total > 0) Then DSLink3.Total Else 1

would become:
If (NullToZero(DSLink3.Total) > 0) Then DSLink3.Total Else 1

If the Replace all occurrences option is selected, the second expression will become:
If (NullToZero(DSLink3.Total) > 0) Then NullToZero(DSLink3.Total) Else 1

The replacement value can be any form of expression string. For example in the case above, the replacement value could have been:
(If (StageVar1 > 50000) Then $1 Else ($1 + 100))

Parallel Job Developers Guide

17-15

Editing BASIC Transformer Stages

BASIC Transformer Stages

In the first case above, the expression


DSLink3.OrderCount + 1

would become:
(If (StageVar1 > 50000) Then DSLink3.OrderCount Else (DSLink3.OrderCount + 100)) + 1

Specifying Before-Stage and After-Stage Subroutines


You can specify BASIC routines to be executed before or after the stage has processed the data. To specify a routine, click the stage properties button in the toolbar to open the Stage Properties dialog box:

The General tab contains the following fields: Before-stage subroutine and Input Value. Contain the name (and value) of a subroutine that is executed before the stage starts to process any data. After-stage subroutine and Input Value. Contain the name (and value) of a subroutine that is executed after the stage has processed the data. Choose a routine from the drop-down list box. This list box contains all the built routines defined as a Before/After Subroutine under the Routines branch in the Repository. Enter an appropriate value for the routines input argument in the Input Value field.

17-16

Parallel Job Developers Guide

BASIC Transformer Stages

Editing BASIC Transformer Stages

If you choose a routine that is defined in the Repository, but which was edited but not compiled, a warning message reminds you to compile the routine when you close the Transformer stage dialog box. If you installed or imported a job, the Before-stage subroutine or After-stage subroutine field may reference a routine that does not exist on your system. In this case, a warning message appears when you close the dialog box. You must install or import the missing routine or choose an alternative one to use. A return code of 0 from the routine indicates success, any other code indicates failure and causes a fatal error when the job is run.

Defining Constraints and Handling Reject Links


You can define limits for output data by specifying a constraint. Constraints are expressions and you can specify a constraint for each output link from a Transformer stage. You can also specify that a particular link is to act as a reject link. Reject links output rows that have not been written on any other output links from the Transformer stage because they have failed or constraints or because a write failure has occurred. To define a constraint or specify an otherwise link, do one of the following: Select an output link and click the constraints button. Double-click the output links constraint entry field. Choose Constraints from the background or header shortcut menus. A dialog box appears which allows you either to define constraints for any of the Transformer output links or to define a link as an reject link. Define a constraint by entering a expression in the Constraint field for that link. Once you have done this, any constraints will appear below the links title bar in the Transformer Editor. This constraint expression will then be checked against the row data at runtime. If the data does not satisfy the constraint, the row will not be written to that link. It is also possible to define a link which can be used to catch these rows which have been rejected from a previous link. A reject link can be defined by choosing Yes in the Reject Row field and setting the Constraint field as follows: To catch rows which are rejected from a specific output link, set the Constraint field to linkname.REJECTED. This will be set whenever a row is rejected on the linkname link, whether because the row fails to match a constraint on that output link, or because a

Parallel Job Developers Guide

17-17

Editing BASIC Transformer Stages

BASIC Transformer Stages

write operation on the target fails for that row. Note that such an otherwise link should occur after the output link from which it is defined to catch rejects. To catch rows which caused a write failures on an output link, set the Constraint field to linkname.REJECTEDCODE. The value of linkname.REJECTEDCODE will be non-zero if the row was rejected due to a write failure or 0 (DSE.NOERROR) if the row was rejected due to the link constraint not being met. When editing the Constraint field, you can set return values for linkname.REJECTEDCODE by selecting from the Expression Editor Link Variables > Constants... menu options. These give a range of errors, but note that most write errors return DSE.WRITERROR. In order to set a reject constraint which differentiates between a write failure and a constraint not being met, a combination of the linkname.REJECTEDCODE and linkname.REJECTED flags can be used. For example:

To catch rows which have failed to be written to an output link, set the Constraint field to linkname.REJECTEDCODE To catch rows which do not meet a constraint on an output link, set the Constraint field to linkname.REJECTEDCODE = DSE.NOERROR AND linkname.REJECTED To catch rows which have been rejected due a a constraint or write error, set the Constraint field to linkname.REJECTED

As a "catch all", the Constraint field can be left blank. This indicates that this otherwise link will catch all rows which have not been successfully written to any of the output links processed up to this point. Therefore, the otherwise link should be the last link in the defined processing order. Any other Constraint can be defined. This will result in the number of rows written to that link (i.e. rows which satisfy the constraint) to be recorded in the job log as "rejected rows".
Note Due to the nature of the "catch all" case above, you should only use one reject link whose Constraint field is blank. To use multiple reject links, you should define them to use the linkname.REJECTED flag detailed in the first case above.

17-18

Parallel Job Developers Guide

BASIC Transformer Stages

Editing BASIC Transformer Stages

Specifying Link Order


You can specify the order in which output links process a row. The initial order of the links is the order in which they are added to the stage. To reorder the links:
1

Do one of the following:


Click the output link execution order button on the Transformer Editor toolbar. Choose output link reorder from the background shortcut menu. Click the stage properties button in the Transformer toolbar or choose stage properties from the background shortcut menu and click on the stage page Link Ordering tab.

Parallel Job Developers Guide

17-19

Editing BASIC Transformer Stages

BASIC Transformer Stages

The Link Ordering tab appears:

2 3

Use the arrow buttons to rearrange the list of links in the execution order required. When you are happy with the order, click OK.

Note Although the link ordering facilities mean that you can use a previous output column to derive a subsequent output column, we do not encourage this practice, and you will receive a warning if you do so.

Defining Local Stage Variables


You can declare and use your own variables within a BASIC Transformer stage. Such variables are accessible only from the BASIC Transformer stage in which they are declared. They can be used as follows: They can be assigned values by expressions. They can be used in expressions which define an output column derivation. Expressions evaluating a variable can include other variables or the variable being evaluated itself. Any stage variables you declare are shown in a table in the right pane of the links area. The table looks similar to an output link. You can display or hide the table by clicking the Stage Variable button in the Transformer toolbar or choosing Stage Variable from the background shortcut menu.

17-20

Parallel Job Developers Guide

BASIC Transformer Stages

Editing BASIC Transformer Stages

Note Stage variables are not shown in the output link meta data area at the bottom of the right pane.

The table lists the stage variables together with the expressions used to derive their values. Link lines join the stage variables with input columns used in the expressions. Links from the right side of the table link the variables to the output columns that use them. To declare a stage variable:
1

Do one of the following:


Click the stage properties button in the Transformer toolbar. Choose stage properties from the background shortcut menu.

The Transformer Stage Properties dialog box appears.


2

Click the Variables tab on the General page. The Variables tab contains a grid showing currently declared variables, their initial values, and an optional description. Use the standard grid controls to add new variables. Variable names must begin with an

Parallel Job Developers Guide

17-21

The DataStage Expression Editor

BASIC Transformer Stages

alphabetic character (az, AZ) and can only contain alphanumeric characters (az, AZ, 09). Ensure that the variable does not use the name of any BASIC keywords.

Variables entered in the Stage Properties dialog box appear in the Stage Variable table in the links pane. You perform most of the same operations on a stage variable as you can on an output column (see page 17-10). A shortcut menu offers the same commands. You cannot, however, paste a stage variable as a new column, or a column as a new stage variable.

The DataStage Expression Editor


The DataStage Expression Editor helps you to enter correct expressions when you edit BASIC Transformer stages. The Expression Editor can: Facilitate the entry of expression elements Complete the names of frequently used variables Validate variable names and the complete expression The Expression Editor can be opened from: Output link Derivation cells Stage variable Derivation cells Constraint dialog box Transform dialog box in the DataStage Manager

17-22

Parallel Job Developers Guide

BASIC Transformer Stages

The DataStage Expression Editor

Expression Format
The format of an expression is as follows:
KEY: something_like_this is a token something_in_italics is a terminal, i.e., doesn't break down any further | is a choice between tokens [ is an optional part of the construction "XXX" is a literal token (i.e., use XXX not including the quotes) ================================================= expression ::= function_call | variable_name | other_name | constant | unary_expression | binary_expression | if_then_else_expression | substring_expression | "(" expression ")" function_call ::= function_name "(" [argument_list] ")" argument_list ::= expression | expression "," argument_list function_name ::= name of a built-in function | name of a user-defined_function variable_name ::= job_parameter name | stage_variable_name | link_variable name other_name ::= name of a built-in macro, system variable, etc. constant ::= numeric_constant | string_constant numeric_constant ::= ["+" | "-"] digits ["." [digits]] ["E" | "e" ["+" | "-"] digits] string_constant ::= "'" [characters] "'" | """ [characters] """ | "\" [characters] "\" unary_expression ::= unary_operator expression unary_operator ::= "+" | "-" binary_expression ::= expression binary_operator expression binary_operator ::= arithmetic_operator | concatenation_operator | matches_operator | relational_operator | logical_operator arithmetic_operator ::= "+" | "-" | "*" | "/" | "^" concatenation_operator ::= ":" matches_operator ::= "MATCHES" relational_operator ::= "=" |"EQ" | "<>" | "#" | "NE" | ">" | "GT" | ">=" | "=>" | "GE" | "<" | "LT" | "<=" | "=<" | "LE" logical_operator ::= "AND" | "OR" if_then_else_expression ::= "IF" expression "THEN" expression "ELSE" expression substring_expression ::= expression "[" [expression ["," expression] "]"

Parallel Job Developers Guide

17-23

The DataStage Expression Editor

BASIC Transformer Stages

field_expression ::= expression "[" expression "," expression "," expression "]" /* That is, always 3 args

Note keywords like "AND" or "IF" or "EQ" may be in any case

Entering Expressions
Whenever the insertion point is in an expression box, you can use the Expression Editor to suggest the next element in your expression. Do this by right-clicking the box, or by clicking the Suggest button to the right of the box. This opens the Suggest Operand or Suggest Operator menu. Which menu appears depends on context, i.e., whether you should be entering an operand or an operator as the next expression element. You will be offered a different selection on the Suggest Operand menu depending on whether you are defining key expressions, derivations and constraints, or a custom transform. The Suggest Operator menu is always the same. Suggest Operand Menu - Transformer Stage:

Suggest Operand Menu - Defining Custom Transforms:

17-24

Parallel Job Developers Guide

BASIC Transformer Stages

The DataStage Expression Editor

Suggest Operator Menu:

Completing Variable Names


The Expression Editor stores variable names. When you enter a variable name you have used before, you can type the first few characters, then press F5. The Expression Editor completes the variable name for you. If you enter the name of an input link followed by a period, for example, DailySales., the Expression Editor displays a list of the column names of that link. If you continue typing, the list selection changes to match what you type. You can also select a column name using the mouse. Enter a selected column name into the expression by pressing Tab or Enter. Press Esc to dismiss the list without selecting a column name.

Validating the Expression


When you have entered an expression in the Transformer Editor, press Enter to validate it. The Expression Editor checks that the syntax is correct and that any variable names used are acceptable to the compiler. When using the Expression Editor to define a custom transform, click OK to validate the expression. If there is an error, a message appears and the element causing the error is highlighted in the expression box. You can either correct the expression or close the Transformer Editor or Transform dialog box. Within the Transformer Editor, the invalid expressions are shown in red. (In some cases this may simply mean that the expression does not meet the strict usage pattern rules of the DataStage engine, but will actually function correctly.)

Exiting the Expression Editor


You can exit the Expression Editor in the following ways: Press Esc (which discards changes). Press Return (which accepts changes).

Parallel Job Developers Guide

17-25

BASIC Transformer Stage Properties

BASIC Transformer Stages

Click outside the Expression Editor box (which accepts changes).

Configuring the Expression Editor


You can resize the Expression Editor window by dragging. The next time you open the expression editor in the same context (for example, editing output columns) on the same client, it will have the same size. The Expression Editor is configured by editing the Designer options. This allows you to specify how helpful the expression editor is. For more information, see "Specifying Designer Options" in DataStage Designer Guide.

BASIC Transformer Stage Properties


The Transformer stage has a Properties dialog box which allows you to specify details about how the stage operates. The Transform Stage dialog box has three pages: Stage page. This is used to specify general information about the stage. Inputs page. This is where you specify details about the data input to the Transformer stage. Outputs page. This is where you specify details about the output links from the Transformer stage.

Stage Page
The Stage page has four tabs: General. Allows you to enter an optional description of the stage and specify a before-stage and/or after-stage subroutine. Variables. Allows you to set up stage variables for use in the stage. Link Ordering. Allows you to specify the order in which the output links will be processed. Advanced. Allows you to specify how the stage executes. The General tab is described in "Before-Stage and After-Stage Routines" on page 17-6. The Variables tab is described in "Defining Local Stage Variables" on page 17-20. The Link Ordering tab is described in "Specifying Link Order" on page 17-19.

17-26

Parallel Job Developers Guide

BASIC Transformer Stages

BASIC Transformer Stage Properties

Advanced Tab
The Advanced tab is the same as the Advanced tab of the generic stage editor as described in "Advanced Tab" on page 3-12. This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In sequential mode the data is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is set to Propagate by default, this sets or clears the partitioning in accordance with what the previous stage has set. You can also select Set or Clear. If you select Set, the stage will request that the next stage preserves the partitioning as is. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about data coming into the Transformer stage. The Transformer stage can have only one input link. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned. This is the same as the Partitioning tab in the generic stage editor described in "Partitioning Tab" on page 3-20.

Parallel Job Developers Guide

17-27

BASIC Transformer Stage Properties

BASIC Transformer Stages

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected when input to the BASIC Transformer stage. It also allows you to specify that the data should be sorted on input. By default the BASIC Transformer stage will attempt to preserve partitioning of incoming data, or use its own partitioning method according to what the previous stage in the job dictates. If the BASIC Transformer stage is operating in sequential mode, it will first collect the data before writing it to the file using the default collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the BASIC Transformer stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning type drop-down list. This will override any current partitioning. If the BASIC Transformer stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the Transformer stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place.

17-28

Parallel Job Developers Guide

BASIC Transformer Stages

BASIC Transformer Stage Properties

DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default method for the Transformer stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen. Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for

Parallel Job Developers Guide

17-29

BASIC Transformer Stage Properties

BASIC Transformer Stages

partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs Page has a General tab which allows you to enter an optional description for each of the output links on the BASIC Transformer stage. The Advanced tab allows you to change the default buffering settings for the output links.

17-30

Parallel Job Developers Guide

18
Aggregator Stage
The Aggregator stage is a processing stage. It classifies data rows from a single input link into groups and computes totals or other aggregate functions for each group. The summed totals for each group are output from the stage via an output link.

When you edit an Aggregator stage, the Aggregator stage editor appears. This is based on the generic stage editor described in Chapter 3, "Stage Editors." The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify details about the data being grouped and/or aggregated. Outputs Page. This is where you specify details about the groups being output from the stage. The aggregator stage gives you access to grouping and summary operations. One of the easiest ways to expose patterns in a collection of records is to group records with similar characteristics, then compute statistics on all records in the group. You can then use these statistics to compare properties of the different groups. For example, records containing cash register transactions might be grouped by the

Parallel Job Developers Guide

18-1

Example

Aggregator Stage

day of the week to see which day had the largest number of transactions, the largest amount of revenue, etc. Records can be grouped by one or more characteristics, where record characteristics correspond to column values. In other words, a group is a set of records with the same value for one or more columns. For example, transaction records might be grouped by both day of the week and by month. These groupings might show that the busiest day of the week varies by season. In addition to revealing patterns in your data, grouping can also reduce the volume of data by summarizing the records in each group, making it easier to manage. If you group a large volume of data on the basis of one or more characteristics of the data, the resulting data set is generally much smaller than the original and is therefore easier to analyze using standard workstation or PC-based tools. At a practical level, you should be aware that, in a parallel environment, the way that you partition data before grouping and summarizing it can affect the results. For example, if you partitioned using the round robin method, records with identical values in the column you are grouping on would end up in different partitions. If you then performed a sum operation within these partitions you would not be operating on all the relevant columns. In such circumstances you may want to key partition the data on one or more of the grouping keys to ensure that your groups are entire. It is important that you bear these facts in mind and take any steps you need to prepare your data set before presenting it to the Aggregator stage. In practice this could mean you use Sort stages or additional Aggregate stages in the job.

Example
The example data is from a freight carrier who charges customers based on distance, equipment, packing, and license requirements. They need a report of distance traveled and charges grouped by date and license type. The following table shows a sample of the data:
Ship Date
... 2000-06-02 2000-07-12 1 1 1540 1320 D D M C BUN SUM 1300 4800

District

Distance

Equipment

Packing

License

Charge

18-2

Parallel Job Developers Guide

Aggregator Stage

Example

Ship Date
2000-08-02 2000-06-22 2000-07-30 ...

District
1 2 2

Distance
1760 1540 1320

Equipment
D D D

Packing
C C M

License
CUM CUN SUM

Charge
1300 13500 6000

The stage will output the following columns:

The stage first hash partitions the incoming data on the license column, then sorts it on license and date:

Parallel Job Developers Guide

18-3

Example

Aggregator Stage

The properties are then used to specify the grouping and the aggregating of the data:

The following is a sample of the output data:


Ship Date
... 2000-06-02 2000-06-12 2000-06-22 2000-06-30 ... BUN BUN BUN BUN 1126053.00 2031526.00 1997321.00 1815733.00 1563.93 2074.08 1958.45 1735.77 20427400.00 22426324.00 19556450.00 17023668.00 28371.39 29843.55 19813.26 18453.02

License

Distance Sum

Distance Mean

Charge Sum

Charge Mean

If you wanted to go on and work out the sum of the distance and charge sums by license, you could insert another Aggregator stage with the following properties:

18-4

Parallel Job Developers Guide

Aggregator Stage

Must Dos

Must Dos
DataStage has many defaults which means that it can be very easy to include Aggregator stages in a job. This section specifies the minimum steps to take to get an Aggregator stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use an aggregator stage: In the Stage Page Properties Tab, under the Grouping Keys category:

Specify the key column that the data will be grouped on. You can repeat the key property to specify composite keys.

Under the Aggregations category:

Choose an aggregation type. Calculation is the default, and allows you to summarize a column or columns. Count rows allows you to count the number of rows within each group. Recalculation allows you to apply aggregate functions to a column that has already been summarized.

Other properties depend on the aggregate type chosen:

If you have chosen the Calculation aggregation type, specify the column to be summarized in Column for Calculation. You can repeat this property to specify multiple columns. Choose one or more dependent properties to specify the type of aggregation to perform, and the name of the output column that will hold the result. If you have chosen the Count Rows aggregation type, specify the output column that will hold the count. If you have chosen the Re-calculation aggregation type, specify the column to be re-calculated. You can repeat this property to specify multiple columns. Choose one or more dependent properties to specify the type of aggregation to perform, and the name of the output column that will hold the result.

In the Output Page Mapping Tab, check that the mapping is as you expect (DataStage maps data onto the output columns according to what you specify in the Properties Tab).

Parallel Job Developers Guide

18-5

Stage Page

Aggregator Stage

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Grouping Keys/ Group Grouping Keys/ Case Sensitive Aggregations/ Aggregation Type Aggregations/ Column for Calculation

Values
Input column True/ False Calculation/ Recalculation/ Count rows Input column

Default
N/A True Calculation

Mandatory?
Y N Y

Repeats?
Y N N

Dependent of
N/A Group N/A

N/A

Y (if Aggregation Type = Calculation) Y (if Aggregation Type = Count Rows) Y (if Aggregation Type = Recalculation) N

N/A

Aggregations/Count Output Output Column column

N/A

N/A

Aggregations/ Summary Column for Recalculation Aggregations/ Default To Decimal Output

Input column

N/A

N/A

precision, scale

8,2

N/A

18-6

Parallel Job Developers Guide

Aggregator Stage

Stage Page

Category/ Property
Aggregations/ Corrected Sum of Squares

Values
Output column

Default
N/A

Mandatory?
N

Repeats?
N

Dependent of
Column for Calculation & Summary Column for Recalculation Column for Calculation & Summary Column for Recalculation Column for Calculation & Summary Column for Recalculation Column for Calculation & Summary Column for Recalculation Column for Calculation Column for Calculation & Summary Column for Recalculation Column for Calculation & Summary Column for Recalculation Column for Calculation & Summary Column for Recalculation Column for Calculation & Summary Column for Recalculation

Aggregations/ Maximum Value

Output column

N/A

Aggregations/ Mean Value

Output column

N/A

Aggregations/ Minimum Value

Output column

N/A

Aggregations/ Missing Value Aggregations/ Missing Values Count

Output column Output column

N/A N/A

N N

Y N

Aggregations/ Output Non-missing Values column Count

N/A

Aggregations/ Percent Coefficient of Variation

Output column

N/A

Aggregations/ Range

Output column

N/A

Parallel Job Developers Guide

18-7

Stage Page

Aggregator Stage

Category/ Property
Aggregations/ Standard Deviation

Values
Output column

Default
N/A

Mandatory?
N

Repeats?
N

Dependent of
Column for Calculation & Summary Column for Recalculation Column for Calculation & Summary Column for Recalculation Column for Calculation & Summary Column for Recalculation Column for Calculation & Summary Column for Recalculation Column for Calculation & Summary Column for Recalculation Column for Calculation & Summary Column for Recalculation Column for Calculation & Summary Column for Recalculation Variance Column for Calculation or Count Output Column Calculation or Recalculation method

Aggregations/ Standard Error

Output column

N/A

Aggregations/ Sum of Weights

Output column

N/A

Aggregations/ Sum

Output column

N/A

Aggregations/ Summary

Output column

N/A

Aggregations/ Output Uncorrected Sum of column Squares

N/A

Aggregations/ Variance

Output column

N/A

Aggregations/ Variance divisor Aggregations/ Calculation and Recalculation Dependent Properties Aggregations/ Decimal Output

Default/ Nrecs Input column

Default N/A

N N

N N

precision, scale

8,2

18-8

Parallel Job Developers Guide

Aggregator Stage

Stage Page

Category/ Property
Options/Group Options/Allow Null Outputs

Values
hash/sort True/ False

Default
hash False

Mandatory?
Y Y

Repeats?
Y N

Dependent of
N/A N/A

Grouping Keys Category


Group Specifies the input columns you are using as group keys. Repeat the property to select multiple columns as group keys. You can use the Column Selection dialog box to select several group keys at once if required (see page 3-10). This property has a dependent property: Case Sensitive Use this to specify whether each group key is case sensitive or not, this is set to True by default, i.e., the values CASE and case in would end up in different groups.

Aggregations Category
Aggregation Type This property allows you to specify the type of aggregation operation your stage is performing. Choose from Calculate (the default), Recalculate, and Count Rows. Column for Calculation The Calculate aggregate type allows you to summarize the contents of a particular column or columns in your input data set by applying one or more aggregate functions to it. Select the column to be aggregated, then select dependent properties to specify the operation to perform on it, and the output column to carry the result. You can use the Column Selection dialog box to select several columns for calculation at once if required (see page 3-10). Count Output Column The Count Rows aggregate type performs a count of the number of records within each group. Specify the column on which the count is output.

Parallel Job Developers Guide

18-9

Stage Page

Aggregator Stage

Summary Column for Recalculation This aggregate type allows you to apply aggregate functions to a column that has already been summarized. This is like calculate but performs the specified aggregate operation on a set of data that has already been summarized. In practice this means you should have performed a calculate (or recalculate) operation in a previous Aggregator stage with the Summary property set to produce a subrecord containing the summary data that is then included with the data set. Select the column to be aggregated, then select dependent properties to specify the operation to perform on it, and the output column to carry the result. You can use the Column Selection dialog box to select several columns for recalculation at once if required (see page 3-10). Weighting column Configures the stage to increment the count for the group by the contents of the weight column for each record in the group, instead of by 1. Not available for Summary Column for Recalculation. Setting this option affects only the following options: Percent Coefficient of Variation Mean Value Sum Sum of Weights Uncorrected Sum of Squares Default To Decimal Output The output type of a calculation or recalculation column is double. Setting this property causes it to default to decimal. You can also set a default precision and scale. (You can also specify that individual columns have decimal output while others retain the defaylt type of double.)

Options Category
Method The aggregate stage has two modes of operation: hash and sort. Your choice of mode depends primarily on the number of groupings in the input data set, taking into account the amount of memory available. You typically use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory to be used.

18-10

Parallel Job Developers Guide

Aggregator Stage

Stage Page

When using hash mode, you should hash partition the input data set by one or more of the grouping key columns so that all the records in the same group are in the same partition (this happens automatically if auto is set in the Partitioning tab). However, hash partitioning is not mandatory, you can use any partitioning method you choose if keeping groups together in a single partition is not important. For example, if youre summing records in each partition and later youll add the sums across all partitions, you dont need all records in a group to be in the same partition to do this. Note, though, that there will be multiple output records for each group. If the number of groups is large, which can happen if you specify many grouping keys, or if some grouping keys can take on many values, you would normally use sort mode. However, sort mode requires the input data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys (this happens automatically if auto is set in the Partitioning tab). Sorting requires a pregrouping operation: after sorting, all records in a given group in the same partition are consecutive. The method property is set to hash by default. You may want to try both modes with your particular data and application to determine which gives the better performance. You may find that when calculating statistics on large numbers of groups, sort mode performs better than hash mode, assuming the input data set can be efficiently sorted before it is passed to group. Allow Null Outputs Set this to True to indicate that null is a valid output value when calculating minimum value, maximum value, mean value, standard deviation, standard error, sum, sum of weights, and variance. If False, the null value will have 0 substituted when all input values for the calculation column are null. It is False by default.

Calculation and Recalculation Dependent Properties


The following properties are dependents of both Column for Calculation and Summary Column for Recalculation. These specify the various aggregate functions and the output columns to carry the results. Corrected Sum of Squares Produces a corrected sum of squares for data in the aggregate column and outputs it to the specified output column.

Parallel Job Developers Guide

18-11

Stage Page

Aggregator Stage

Maximum Value Gives the maximum value in the aggregate column and outputs it to the specified output column. Mean Value Gives the mean value in the aggregate column and outputs it to the specified output column. Minimum Value Gives the minimum value in the aggregate column and outputs it to the specified output column. Missing Value This specifies what constitutes a missing value, for example -1 or NULL. Enter the value as a floating point number. Not available for Summary Column to Recalculate. Missing Values Count Counts the number of aggregate columns with missing values in them and outputs the count to the specified output column. Not available for recalculate. Non-missing Values Count Counts the number of aggregate columns with values in them and outputs the count to the specified output column. Percent Coefficient of Variation Calculates the percent coefficient of variation for the aggregate column and outputs it to the specified output column. Range Calculates the range of values in the aggregate column and outputs it to the specified output column. Standard Deviation Calculates the standard deviation of values in the aggregate column and outputs it to the specified output column. Standard Error Calculates the standard error of values in the aggregate column and outputs it to the specified output column.

18-12

Parallel Job Developers Guide

Aggregator Stage

Stage Page

Sum of Weights Calculates the sum of values in the weight column specified by the Weight column property and outputs it to the specified output column. Sum Sums the values in the aggregate column and outputs the sum to the specified output column. Summary Specifies a subrecord to write the results of the calculate or recalculate operation to. Uncorrected Sum of Squares Produces an uncorrected sum of squares for data in the aggregate column and outputs it to the specified output column. Variance Calculates the variance for the aggregate column and outputs the sum to the specified output column. This has a dependent property:

Variance divisor Specifies the variance divisor. By default, uses a value of the number of records in the group minus the number of records with missing values minus 1 to calculate the variance. This corresponds to a vardiv setting of Default. If you specify NRecs, DataStage uses the number of records in the group minus the number of records with missing values instead.

Each of these properties has a dependent property as follows:

Decimal Output. By default all calculation or recalculation columns have an output type of double. This property allows you to specify that the column has an output type of decimal. You can also specify a precision and scale for they type (by default 8,2).

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data set is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.
Parallel Job Developers Guide 18-13

Stage Page

Aggregator Stage

Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Set by default. You can select Set or Clear. If you select Set the stage will request that the next stage in the job attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

NLS Locale Tab


This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Aggregator stage uses this when it is grouping by key to determine the order of the key fields. Select a locale from the list, or

18-14

Parallel Job Developers Guide

Aggregator Stage

Inputs Page

click the arrow button next to the list to use a job parameter or browse for a collate file.

Inputs Page
The Inputs page allows you to specify details about the incoming data set. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being grouped and/or summarized. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Aggregator stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is grouped and/or summarized. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of

Parallel Job Developers Guide

18-15

Inputs Page

Aggregator Stage

current and preceding stages and how many nodes are specified in the Configuration file. If the Aggregator stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Aggregator stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Aggregator stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Aggregator stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Aggregator stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .

18-16

Parallel Job Developers Guide

Aggregator Stage

Inputs Page

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Aggregator stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the file or files. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto modes). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Parallel Job Developers Guide

18-17

Outputs Page

Aggregator Stage

Outputs Page
The Outputs page allows you to specify details about data output from the Aggregator stage. The Aggregator stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the processed data being produced by the Aggregator stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output link. Details about Aggregator stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Mapping Tab
For the Aggregator stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the input columns and/or the generated columns. These are read only and cannot be modified on this tab. The right pane shows the output columns for each link. This has a Derivations field where you can specify how the column is derived.You can fill it in by dragging columns over from the left pane, or by using the Auto-match facility. In the above example the left pane represents the data after it has been grouped and summarized. The Expression field shows how the column has been derived. The right pane represents the data being

18-18

Parallel Job Developers Guide

Aggregator Stage

Outputs Page

output by the stage after the grouping and summarizing. In this example ocol1 carries the value of the key field on which the data was grouped (for example, if you were grouping by date it would contain each date grouped on). Column ocol2 carries the mean of all the col2 values in the group, ocol4 the minimum value, and ocol3 the sum.

Parallel Job Developers Guide

18-19

Outputs Page

Aggregator Stage

18-20

Parallel Job Developers Guide

19
Join Stage
The Join stage is a processing stage. It performs join operations on two or more data sets input to the stage and then outputs the resulting data set. The Join stage is one of three stages that join tables based on the values of key columns. The other two are: Lookup stage Chapter 21 Merge stage Chapter 20 The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and their requirements for data being input (for example, whether it is sorted). See "Join Versus Lookup" on page 19-2 for help in deciding which stage to use. In the Join stage, the input data sets are notionally identified as the right set and the left set, and intermediate sets. You can specify which is which. It has any number of input links and a single output link.

The stage can perform one of four join operations:

Parallel Job Developers Guide

19-1

Join Stage

Inner transfers records from input data sets whose key columns contain equal values to the output data set. Records whose key columns do not contain equal values are dropped. Left outer transfers all values from the left data set but transfers values from the right data set and intermediate data sets only where key columns match. The stage drops the key column from the right and intermediate data sets. Right outer transfers all values from the right data set and transfers values from the left data set and intermediate data sets only where key columns match. The stage drops the key column from the left and intermediate data sets. Full outer transfers records in which the contents of the key columns are equal from the left and right input data sets to the output data set. It also transfers records whose key columns contain unequal values from both input data sets to the output data set. (Full outer joins do not support more than two input links.) The data sets input to the Join stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time. Choosing the auto partitioning method will ensure that partitioning and sorting is done. If sorting and partitioning are carried out on separate stages before the Join stage, DataStage in auto mode will detect this and not repartition (alternatively you could explicitly specify the Same partitioning method). The Join stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify details about the data sets being joined. Outputs Page. This is where you specify details about the joined data being output from the stage.

Join Versus Lookup


DataStage doesn't know how large your data is, so cannot make an informed choice whether to combine data using a join stage or a lookup stage. Here's how to decide which to use: There are two data sets being combined. One is the primary or driving dataset, sometimes called the left of the join. The other data set(s) are the reference datasets, or the right of the join.

19-2

Parallel Job Developers Guide

Join Stage

Example Joins

In all cases we are concerned with the size of the reference datasets. If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a lookup stage may thrash because the reference datasets may not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically does, cause a page fault and an I/O operation. So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. Once the sort is over the join processing is very fast and never involves paging or other I/O.

Example Joins
The following examples show what happens to two data sets when each type of join operation is applied to them. Here are the two data sets:
Left Input Data Set
Status sold sold offered Pending Pending Offered Offered Pending Price 125 213 378 575 649 777 908 908

Right Input Data Set


Price 113 125 285 628 668 777 908 908 ID NI6325 BR9658 CZ2538 RU5713 SA5680 JA1081 DE1911 FR2081

Price is the key column which is going to be joined on, and bold type indicates where the data sets share the same value for Price. The data sets are already sorted on that key.

Parallel Job Developers Guide

19-3

Example Joins

Join Stage

Inner Join
Here is the data set that is output if you perform an inner join on the Price key column:
Output Data Set Status
sold Offered Offered Offered Pending Pending

Price
125 777 908 908 908 908

ID
NI6325 JA1081 DE1911 FR2081 DE1911 FR2081

Left Outer Join


Here is the data set that is output if you perform a left outer join on the Price key column:
Output Data Set Status
sold sold offered Pending Pending Offered Offered Offered Pending Pending

Price
125 213 378 575 649 777 908 908 908 908

ID
NI6325

JA1081 DE1911 FR2081 DE1911 FR2081

19-4

Parallel Job Developers Guide

Join Stage

Example Joins

Right Outer Join


Here is the data set that is output if you perform a right outer join on the Price key column:
Output Data Set Status Price
113 sold 125 285 628 668 Offered Offered Offered Pending Pending 777 908 908 908 908

ID
NI6325 BR9658 CZ2538 RU5713 SA5680 JA1081 DE1911 FR2081 DE1911 FR2081

Full Outer Join


Here is the data set that is output if you perform a full outer join on the Price key column:
Output Data Set Status Price Price
113 sold sold 125 213 285 offered Pending 378 575 628 Pending 649 668 Status Price Price SA5680 ID RU5713 CZ2538 125

ID
NI6325 BR9658

Parallel Job Developers Guide

19-5

Must Dos

Join Stage

Output Data Set


Offered Offered Offered Pending Pending 777 908 908 908 908 777 908 908 908 908 JA1081 DE1911 FR2081 DE1911 FR2081

Must Dos
DataStage has many defaults which means that Joins can be simple to set up. This section specifies the minimum steps to take to get a Join stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. In the Stage Page Properties Tab specify the key column or columns that the join will be performed on. In the Stage Page Properties Tab specify the join type or accept the default of Inner. In the Stage Page Link Ordering Tab, check that your links are correctly identified as left right and intermediate and , , reorder if required. Ensure required column meta data has been specified (this may be done in another stage). In the Outputs Page Mapping Tab, specify how the columns from the input links map onto output columns.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link Ordering tab allows you to specify which of the input links is the right link and which is the left link and which are intermediate. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules.

19-6

Parallel Job Developers Guide

Join Stage

Stage Page

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Join Keys/Key Join Keys/Case Sensitive Options/Join Type

Values
Input Column True/False Full Outer/ Inner/Left Outer/ Right Outer

Default Mandatory?
N/A True Inner Y N Y

Repeats? Dependent of
Y N N N/A Key N/A

Join Keys Category


Key Choose the input column you want to join on. You are offered a choice of input columns common to all links. For a join to work you must join on a column that appears in all input data sets, i.e. have the same name and compatible data types. If, for example, you select a column called name from the left link, the stage will expect there to be an equivalent column called name on the right link. You can join on multiple key columns. To do so, repeat the Key property. You can use the Column Selection dialog box to select several key columns at once if required (see page 3-10). Key has a dependent property: Case Sensitive Use this to specify whether each group key is case sensitive or not, this is set to True by default, i.e., the values CASE and case in would not be judged equivalent.

Parallel Job Developers Guide

19-7

Stage Page

Join Stage

Options Category
Join Type Specify the type of join operation you want to perform. Choose one of: Full Outer Inner Left Outer Right Outer The default is Inner.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts the setting which results from ORing the settings of the input stages, i.e., if either of the input stages uses Set then this stage will use Set. You can explicitly select Set or Clear. Select Set to request that the next stage in the job attempts to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

19-8

Parallel Job Developers Guide

Join Stage

Stage Page

Link Ordering Tab


This tab allows you to specify which input link is regarded as the left link and which link is regarded as the right link, and which links are regarded as intermediate. By default the first link you add is regarded as the left link, and the last one as the right link, with all other links labelled as Intermediate N. You can use this tab to override the default order.

In the example DSLink4 is the left link, click on it to select it then click on the down arrow to convert it into the right link.

NLS Locale Tab


This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Join stage uses this when it is determining the order of the key

Parallel Job Developers Guide

19-9

Inputs Page

Join Stage

fields. Select a locale from the list, or click the arrow button next to the list to use a job parameter or browse for a collate file.

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. Choose an input link from the Input name drop down list to specify which link you want to work on. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being joined. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Join stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning on Input Links


The Partitioning tab allows you to specify details about how the data on each of the incoming links is partitioned or collected before it is joined. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of

19-10

Parallel Job Developers Guide

Join Stage

Inputs Page

current and preceding stages and how many nodes are specified in the Configuration file. Auto mode ensures that data being input to the Join stage is key partitioned and sorted. If the Join stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Join stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Join stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Join stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default collection method for the Join stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .

Parallel Job Developers Guide

19-11

Inputs Page

Join Stage

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for the Join stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. In the case of a Join stage, Auto will also ensure that the collected data is sorted. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to explicitly specify that data arriving on the input link should be sorted before being joined (you might use this if you have selected a partitioning method other than auto or same). The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you

19-12

Parallel Job Developers Guide

Join Stage

Outputs Page

can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Join stage. The Join stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Join stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output link. Details about Join stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Mapping Tab
For Join stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them.

The left pane shows the input columns from the links whose tables have been joined. These are read only and cannot be modified on this tab.

Parallel Job Developers Guide

19-13

Outputs Page

Join Stage

The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived.You can fill it in by dragging input columns over, or by using the Auto-match facility. In the above example the left pane represents the data after it has been joined. The Expression field shows how the column has been derived, the Column Name shows the column after it has been joined. The right pane represents the data being output by the stage after the join. In this example the data has been mapped straight across.

19-14

Parallel Job Developers Guide

20
Merge Stage
The Merge stage is a processing stage. It can have any number of input links, a single output link, and the same number of reject links as there are update input links. The Merge stage is one of three stages that join tables based on the values of key columns. The other two are: Join stage Chapter 19 Lookup stage Chapter 21 The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and their requirements for data being input (for example, whether it is sorted). The Merge stage combines a master data set with one or more update data sets. The columns from the records in the master and update data sets are merged so that the output record contains all the columns from the master record plus any additional columns from each update record that are required. A master record and an update record are merged only if both of them have the same values for the merge key column(s) that you specify. Merge key columns are one or more columns that exist in both the master and update records.

Parallel Job Developers Guide

20-1

Merge Stage

The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time. Choosing the auto partitioning method will ensure that partitioning and sorting is done. If sorting and partitioning are carried out on separate stages before the Merge stage, DataStage in auto partition mode will detect this and not repartition (alternatively you could explicitly specify the Same partitioning method). As part of preprocessing your data for the Merge stage, you should also remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well. See Chapter 24 for information about the Remove Duplicates stage. Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links. You can route update link rows that fail to match a master row down a reject link that is specific for that link. You must have the same number of reject links as you have update links. The Link Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject links. You can also specify whether to drop unmatched master rows, or output them on the output data link. The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify details about the data sets being merged.
20-2 Parallel Job Developers Guide

Merge Stage

Example Merge

Outputs Page. This is where you specify details about the merged data being output from the stage and about the reject links.

Example Merge
This example shows what happens to a master data set and two update data sets when they are merged. The key field is Horse, and all the data sets are sorted in descending order. Here is the master data set:
Horse
William Robin Kayser Heathcliff Fairfax Chaz

Freezemark
DAM7 DG36 N/A A1B1 N/A N/A

Mchip
N/A N/A N/A N/A N/A a296100da

Reg_Soc
FPS FPS AHS N/A FPS AHS

Level
Adv Nov N/A Adv N/A Inter

Here is the Update 1 data set:


Horse
William Robin Kayser Heathcliff Fairfax Chaz

vacc.
07.07.02 07.07.02 11.12.02 07.07.02 11.12.02 10.02.02

last_worm
12.10.02 12.10.02 12.10.02 12.10.02 12.10.02 12.10.02

Parallel Job Developers Guide

20-3

Must Dos

Merge Stage

Here is the Update 2 data set:


Horse
William Robin Kayser Heathcliff Fairfax Chaz

last_trim
11.05.02 12.03.02 11.05.02 12.03.02 12.03.02 12.03.02

shoes
N/A refit N/A new N/A new

Here is the merged data set output by the stage:


Horse
William Robin Kayser Fairfax Chaz

Freezemark
DAM7 DG36 N/A N/A N/A

Mchip
N/A N/A N/A N/A N/A a2961da

Reg_ Soc
FPS FPS AHS N/A FPS AHS

Level
Adv Nov N/A Adv N/A Inter

vacc.
07.07.02 07.07.02 11.12.02 07.07.02 11.12.02 10.02.02

last_ worm

last_ trim

shoes
N/A refit N/A new N/A new

12.10.02 11.05.02 12.10.02 12.03.02 12.10.02 11.05.02 12.10.02 12.03.02 12.10.02 12.03.02 12.10.02 12.03.02

Heathcliff A1B1

Must Dos
DataStage has many defaults which means that Merges can be simple to set up. This section specifies the minimum steps to take to get a Merge stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. In the Stage Page Properties Tab specify the key column or columns that the Merge will be performed on. In the Stage Page Properties Tab set the Unmatched Masters Mode, Warn on Reject Updates, and Warn on Unmatched Masters options or accept the defaults. In the Stage Page Link Ordering Tab, check that your input links are correctly identified as master and update(s) and your , output links are correctly identified as master and update reject Reorder if required. .

20-4

Parallel Job Developers Guide

Merge Stage

Stage Page

Ensure required column meta data has been specified (this may be done in another stage, or may be omitted altogether if you are relying on Runtime Column Propagation). In the Outputs Page Mapping Tab, specify how the columns from the input links map onto output columns.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules.

Properties Tab
The Properties tab allows you to specify properties that determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Merge Keys/Key Merge Keys/Sort Order Merge Keys/Nulls position Merge Keys/Sort as EBCDIC Merge Keys/Case Sensitive Options/Unmatched Masters Mode Options/Warn On Reject Masters

Values
Input Column Ascending/ Descending First/Last True/False

Default
N/A Ascendin g First False

Mandatory? Repeats?
Y Y N N Y N N N

Dependent of
N/A Key Key Key

True/False Keep/Drop True/False

True Keep True

N Y Y

N N N

Key N/A N/A

Parallel Job Developers Guide

20-5

Stage Page

Merge Stage

Category/ Property
Options/Warn On Reject Updates

Values
True/False

Default
True

Mandatory? Repeats?
Y N

Dependent of
N/A

Merge Keys Category


Key This specifies the key column you are merging on. Repeat the property to specify multiple keys. You can use the Column Selection dialog box to select several keys at once if required (see page 3-10). Key has the following dependent properties: Sort Order Choose Ascending or Descending. The default is Ascending. Nulls position By default columns containing null values appear first in the merged data set. To override this default so that columns containing null values appear last in the merged data set, select Last. Sort as EBCDIC To sort as in the EBCDIC character set, choose True. Case Sensitive Use this to specify whether each merge key is case sensitive or not, this is set to True by default, i.e., the values CASE and case would not be judged equivalent.

Options Category
Unmatched Masters Mode Set to Keep by default. It specifies that unmatched rows from the master link are output to the merged data set. Set to Drop to specify that rejected records are dropped instead. Warn On Reject Masters Set to True by default. This will warn you when bad records from the master link are rejected. Set it to False to receive no warnings.

20-6

Parallel Job Developers Guide

Merge Stage

Stage Page

Warn On Reject Updates Set to True by default. This will warn you when bad records from any update links are rejected. Set it to False to receive no warnings.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts the setting which results from ORing the settings of the input stages, i.e., if any of the input stages uses Set then this stage will use Set. You can explicitly select Set or Clear. Select Set to request the next stage in the job attempts to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Link Ordering Tab


This tab allows you to specify which of the input links is the master link and the order in which links input to the Merge stage are processed. You can also specify which of the output links is the master

Parallel Job Developers Guide

20-7

Stage Page

Merge Stage

link, and which of the reject links corresponds to which of the incoming update links.

By default the links will be processed in the order they were added. To rearrange them, choose an input link and click the up arrow button or the down arrow button.

NLS Locale Tab


This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Merge stage uses this when it is determining the order of the key

20-8

Parallel Job Developers Guide

Merge Stage

Inputs Page

fields. Select a locale from the list, or click the arrow button next to the list to use a job parameter or browse for a collate file.

Inputs Page
The Inputs page allows you to specify details about the data coming in to be merged. Choose an input link from the Input name drop down list to specify which link you want to work on. The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Merge stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the merge is performed. By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will warn you if it cannot preserve the partitioning of the

Parallel Job Developers Guide

20-9

Inputs Page

Merge Stage

incoming data. Auto mode ensures that data being input to the Merge stage is key partitioned and sorted. If the Merge stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Merge stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Merge stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Merge stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default collection method for the Merge stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .

20-10

Parallel Job Developers Guide

Merge Stage

Inputs Page

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for the Merge stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. In the case of a Merge stage, Auto will also ensure that the collected data is sorted. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the merge is performed (you might use this if you have selected a partitioning method other than auto or same). The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you

Parallel Job Developers Guide

20-11

Outputs Page

Merge Stage

can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Merge stage. The Merge stage can have only one master output link carrying the merged data and a number of reject links, each carrying rejected records from one of the update links. Choose an output link from the Output name drop down list to specify which link you want to work on. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the columns being input to the Merge stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output links. Details about Merge stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Reject Links
You cannot change the properties of a Reject link. They have the meta data of the corresponding incoming update link and this cannot be altered.

20-12

Parallel Job Developers Guide

Merge Stage

Outputs Page

Mapping Tab
For Merge stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them.

The left pane shows the columns of the merged data. These are read only and cannot be modified on this tab. This shows the meta data from the master input link and any additional columns carried on the update links. The right pane shows the output columns for the master output link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Auto-match facility. In the above example the left pane represents the incoming data after the merge has been performed. The right pane represents the data being output by the stage after the merge operation. In this example the data has been mapped straight across.

Parallel Job Developers Guide

20-13

Outputs Page

Merge Stage

20-14

Parallel Job Developers Guide

21
Lookup Stage
The Lookup stage is a processing stage. It is used to perform lookup operations on a data set read into memory from any other Parallel job stage that can output data. It can also perform lookups directly in a DB2 or Oracle database (see Chapter 12 and Chapter 13) or in a lookup table contained in a Lookup File Set stage (see Chapter 7) The most common use for a lookup is to map short codes in the input data set onto expanded information from a lookup table which is then joined to the incoming data and output. For example, you could have an input data set carrying names and addresses of your U.S. customers. The data as presented identifies state as a two letter U. S. state postal code, but you want the data to carry the full name of the state. You could define a lookup table that carries a list of codes matched to states, defining the code as the key column. As the Lookup stage reads each line, it uses the key to look up the state in the lookup table. It adds the state to a new column defined for the output link, and so the full state name is added to each address. If any state codes have been incorrectly entered in the data set, the code will not be found in the lookup table, and so that record will be rejected. Lookups can also be used for validation of a row. If there is no corresponding entry in a lookup table to the keys values, the row is rejected. The Lookup stage is one of three stages that join tables based on the values of key columns. The other two are: Join stage Chapter 19 Merge stage Chapter 20 The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and their requirements for data being

Parallel Job Developers Guide

21-1

Lookup Stage

input (for example, whether it is sorted). See "Lookup Versus Join" on page 21-5 for help in deciding which stage to use. The Lookup stage can have a reference link, a single input link, a single output link, and a single rejects link. Depending upon the type and setting of the stage(s) providing the look up information, it can have multiple reference links (where it is directly looking up a DB2 table or Oracle table, it can only have a single reference link). A lot of the setting up of a lookup operation takes place on the stage providing the lookup table. The input link carries the data from the source data set and is known as the primary link. The following pictures show some example jobs performing lookups.

21-2

Parallel Job Developers Guide

Lookup Stage

For each record of the source data set from the primary link, the Lookup stage performs a table lookup on each of the lookup tables attached by reference links. The table lookup is based on the values of a set of lookup key columns, one set for each table. The keys are defined on the Lookup stage. For lookups of data accessed through the Lookup File Set stage, the keys are specified when you create the look up file set. You can specify a condition on each of the reference links, such that the stage will only perform a lookup on that reference link if the condition is satisfied. Lookup stages do not require data on the input link or reference links to be sorted. Be aware, though, that large in-memory look up tables will degrade performance because of their paging requirements. Each record of the output data set contains columns from a source record plus columns from all the corresponding lookup records where corresponding source and lookup records have the same value for the lookup key columns. The lookup key columns do not have to have the same names in the primary and the reference links. The optional reject link carries source records that do not have a corresponding entry in the input lookup tables. There are some special partitioning considerations for lookup stages. You need to ensure that the data being looked up in the lookup table is in the same partition as the input data referencing it. One way of doing this is to partition the lookup tables using the Entire method. Another way is to partition it in the same way as the input data (although this implies sorting of the data).

Parallel Job Developers Guide

21-3

Lookup Stage

Unlike most of the other stages in a Parallel job, the Lookup stage has its own user interface. It does not use the generic interface as described in Chapter 3. When you edit a Lookup stage, the Lookup Editor appears. An example Lookup stage is shown below. The left pane represents input data and lookup data, and the right pane represents output data. In this example, the Lookup stage has a primary link and single reference link, and a single output link. Meta data has been defined for all links.

21-4

Parallel Job Developers Guide

Lookup Stage

Lookup Versus Join

Lookup Versus Join


DataStage doesn't know how large your data is, so cannot make an informed choice whether to combine data using a join stage or a lookup stage. Here's how to decide which to use: There are two data sets being combined. One is the primary or driving dataset, sometimes called the left of the join. The other data set(s) are the reference datasets, or the right of the join. In all cases we are concerned with the size of the reference datasets. If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a lookup stage may thrash because the reference datasets may not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically does, cause a page fault and an I/O operation. So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. Once the sort is over the join processing is very fast and never involves paging or other I/O.

Example Look Up
This example shows what happens when data is looked up in a lookup table. The stage in this case will look up the interest rate for each customer based on the account type. Here is the data that arrives on the primary link:
Customer
Latimer Ridley Cranmer Hooper Moore

accountNo
7125678 7238892 7611236 7176672 7146789

accountType
plat flexi gold flexi gold

balance
7890.76 234.88 1288.00 3456.99 424.76

Parallel Job Developers Guide

21-5

Example Look Up

Lookup Stage

Here is the data in the lookup table:


accountType
bronze silver gold plat flexi fixterm

InterestRate
1.25 1.50 1.75 2.00 1.88 3.00

Here is what the lookup stage will output:


Customer
Latimer Ridley Cranmer Hooper Moore

accountNo
7125678 7238892 7611236 7176672 7146789

accountType
plat flexi gold flexi gold

balance InterestRate
7890.76 234.88 1288.00 3456.99 424.76 2.00 1.88 1.75 1.88 1.75

Here is a job that performs this simple lookup:

21-6

Parallel Job Developers Guide

Lookup Stage

Must Dos

The accounts data set holds the details of customers and their account types, the interest rates are held in an Oracle table. The lookup stage is set as follows:

All the columns in the accounts data set are mapped over to the output link. The AccountType column in the accounts data set has been joined to the AccountType column of the interest_rates table. For each row, the AccountType is looked up in the interest_rates table and the corresponding interest rate is returned. The reference link has a condition on it. This detects if the balance is null in any of the rows of the accounts data set. If the balance is null the row is sent to the rejects link (the rejects link does not appear in the lookup editor because there is nothing you can change).

Must Dos
DataStage has many defaults which means that lookups can be simple to set up. This section specifies the minimum steps to take to get a Lookup stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. The exact steps you need to take when setting up a Lookup stage depend on what type of lookup table you are looking up.
Parallel Job Developers Guide 21-7

Must Dos

Lookup Stage

Using In-Memory Lookup tables


If you are accessing a lookup table read into memory from some other stage, you need to do the following: In the Data Input source stage: Specify details about the data source (for example, if using a File Set stage, give the name of the File Set). Ensure required column meta data has been specified. Fulfil any must dos for that particular stage editor. In the stage providing the lookup table: Ensure required column meta data has been specified. Fulfil any must dos for that particular stage editor. In the Lookup stage: Map the required columns from your data input link to the output link (you can drag them or copy and paste them). Map the required columns from your lookup table or tables to the output link (again you can drag them or copy and paste them). Specify the key column or columns which are used for the lookup. Do this by dragging or copying and pasting key columns from the data link to the Key Expression field in the lookup table link. Note that key expressions can only be specified on key fields (i.e. columns that have the key field selected in the column definitions). If you drag a column that is not currently defined as a key, you are asked if you want to make it one. If you want the comparison performed on this column to ignore case, then select the Caseless checkbox. If you want to impose conditions on your lookup, or want to use a reject link, you need to double click on the Condition header of a reference link, choose Conditions from the link shortcut menu, or click the Condition toolbar button. The Lookup Stage Conditions dialog box appears. This allows you to: Specify that one of the reference links is allowed to return multiple rows when performing a lookup without causing an error (choose the relevant reference link from the Multiple rows returned from link drop-down list). Specify a condition for the required references. Double click the Condition box (or press CTRL-E) to open the expression editor. This expression can access all the columns of the primary link, plus columns in reference links that are processed before this link. Specify what happens if the condition is not met on each link.

21-8

Parallel Job Developers Guide

Lookup Stage

Must Dos

Specify what happens if a lookup fails on each link. Next you need to open the Stage Properties dialog box for the Lookup stage. Do this by choosing the Stage Properties icon from the stage editor toolbar, or by choosing Stage Properties or Link Properties from the stage editor shortcut menu (choosing Link Properties will open the dialog with the link you are looking at selected, otherwise you may need to choose the correct link from the Input name or Output name drop-down list). In the Stage Page Link Ordering Tab, check that your links are correctly identified as primary and lookup(s) and reorder if , required (the links will be shown in the new order on the Lookup canvas). Unless you have particular partitioning requirements, leave the default auto setting on the Inputs Page Partitioning Tab.

Using Oracle or DB2 Databases Directly


If you are doing a direct look up in an Oracle or DB2 database table (known as sparse mode), you need to do the following: In the Data Input source stage: Specify details about the data source (for example, if using a File Set stage, give the name of the File Set). Ensure required column meta data has been specified (this may be done in another stage). Fulfil any must dos for that particular stage editor. In the Oracle or DB2/UDB Enterprise Stage: Set the Lookup Type to sparse. If you dont do this the lookup will operate as an in-memory lookup. Specify required details for connecting to the database table. Ensure required column meta data has been specified (this may be omitted altogether if you are relying on Runtime Column Propagation). See Chapter 12 for details about the DB2/UDB Enterprise Stage and Chapter 13 for details about the Oracle Enterprise Stage. In the Lookup stage: Map the required columns from your data input link to the output link (you can drag them or copy and paste them). Map the required columns from your lookup table or tables to the output link (again you can drag them or copy and paste them).

Parallel Job Developers Guide

21-9

Must Dos

Lookup Stage

If you want to impose conditions on your lookup, or want to use a reject link, you need to double click on the Condition header, choose Conditions from the link shortcut menu, or click the Condition toolbar icon.. The Lookup Stage Conditions dialog box appears. This allows you to: Specify what happens if a lookup fails on this link. Next you need to open the Stage Properties dialog box for the Lookup stage. Do this by choosing the Stage Properties icon from the stage editor toolbar, or by choosing Stage Properties or Link Properties from the stage editor shortcut menu (choosing Link Properties will open the dialog with the link you are looking at selected, otherwise you may need to choose the correct link from the Input name or Output name drop-down list). In the Stage Page Link Ordering Tab, check that your links are correctly identified as primary and lookup(s) and reorder if , required. Unless you have particular partitioning requirements, leave the default auto setting on the Inputs Page Partitioning Tab.

Using Lookup Fileset


If you are accessing a lookup table held in a lookup fileset that you have previously created using DataStage, you need to do the following: In the Data Input source stage: Specify details about the data source (for example, if using a File Set stage, give the name of the File Set). Ensure required column meta data has been specified. Fulfil any must dos for that particular stage editor. In the Lookup File stage: Specify the name of the file set holding the lookup table. Make sure that the key column or columns were specified when the file set holding the lookup table was created. Ensure required column meta data has been specified. See Chapter 7 for details about the Lookup File stage. In the Lookup stage: Map the required columns from your data input link to the output link (you can drag them or copy and paste them).

21-10

Parallel Job Developers Guide

Lookup Stage

Lookup Editor Components

Map the required columns from your lookup table or tables to the output link (again you can drag them or copy and paste them). As you are using a lookup file set this is all the mapping you need to do, the key column or columns for the lookup is defined when you create the lookup file set. Next you need to open the Stage Properties dialog box for the Lookup stage. Do this by choosing the Stage Properties icon from the stage editor toolbar, or by choosing Stage Properties or Link Properties from the stage editor shortcut menu (choosing Link Properties will open the dialog with the link you are looking at selected, otherwise you may need to choose the correct link from the Input name or Output name drop-down list). In the Stage Page Link Ordering Tab, check that your links are correctly identified as primary and lookup(s) and reorder if , required. Unless you have particular partitioning requirements, leave the default auto setting on the Inputs Page Partitioning Tab.

Lookup Editor Components


The Lookup Editor has the following components.

Toolbar
The Lookup toolbar contains the following buttons:
show all or selected relations column auto-match stage properties input link find/replace execution order output link execution order save column definition paste load column definition

conditions

cut copy

Link Area
The top area displays links to and from the Lookup stage, showing their columns and the relationships between them. The link area is divided into two panes; you can drag the splitter bar between them to resize the panes relative to one another. There is also a horizontal scroll bar, allowing you to scroll the view left or right.

Parallel Job Developers Guide

21-11

Lookup Editor Components

Lookup Stage

The left pane shows the input link, the right pane shows output links. Output columns that have an invalid derivation defined are shown in red. Reference link input key columns with invalid key expressions are also shown in red. Within the Lookup Editor, a single link may be selected at any one time. When selected, the links title bar is highlighted, and arrowheads indicate any selected columns within that link.

Meta Data Area


The bottom area shows the column meta data for input and output links. Again this area is divided into two panes: the left showing input link meta data and the right showing output link meta data. The meta data for each link is shown in a grid contained within a tabbed page. Click the tab to bring the required link to the front. That link is also selected in the link area. If you select a link in the link area, its meta data tab is brought to the front automatically. You can edit the grids to change the column meta data on any of the links. You can also add and delete meta data. As with column meta data grids on other stage editors, edit row in the context menu allows editing of the full meta data definitions (see "Columns Tab" on page 3-26).

Shortcut Menus
The Lookup Editor shortcut menus are displayed by right-clicking the links in the links area. There are slightly different menus, depending on whether you rightclick an input link, or an output link. The input link menu offers you operations on input columns, the output link menu offers you operations on output columns and their derivations. The shortcut menu enables you to: Open the Stage Properties dialog box in order to specify stage or link properties. Open the Lookup Stage Conditions dialog box to specify a conditional lookup. Open the Column Auto Match dialog box. Display the Find/Replace dialog box. Display the Select dialog box.

21-12

Parallel Job Developers Guide

Lookup Stage

Editing Lookup Stages

Validate, or clear a derivation. Append a new column to the selected link. Select all columns on a link. Insert or delete columns. Cut, copy, and paste a column or a key expression or a derivation. If you display the menu from the links area background, you can: Open the Stage Properties dialog box in order to specify stage or link properties. Open the Lookup Stage Conditions dialog box to specify a conditional lookup. Open the Link Execution Order dialog box in order to specify the order in which links should be processed. Toggle between viewing link relations for all links, or for the selected link only. Right-clicking in the meta data area of the Lookup Editor opens the standard grid editing shortcut menus.

Editing Lookup Stages


The Lookup Editor enables you to perform the following operations on a Lookup stage: Create new columns on a link Delete columns from within a link Move columns within a link Edit column meta data Specify key expressions Map input columns to output columns

Using Drag and Drop


Many of the Lookup stage edits can be made simpler by using the Lookup Editors drag and drop functionality. You can drag columns from any link to any other link. Common uses are: Copying input columns to output links Moving columns within a link Setting derivation or key expressions

Parallel Job Developers Guide

21-13

Editing Lookup Stages

Lookup Stage

To use drag and drop:


1 2

Click the source cell to select it. Click the selected cell again and, without releasing the mouse button, drag the mouse pointer to the desired location within the target link. An insert point appears on the target link to indicate where the new cell will go. This can be to create a new column, or set a derivation. The exact action depends on where you drop. Release the mouse button to drop the selected cell.

You can drag and drop multiple columns, key expressions, or derivations. Use the standard Explorer keys when selecting the source column cells, then proceed as for a single cell. You can drag and drop the full column set by dragging the link title. The drag and drop insert point is shown below:

Find and Replace Facilities


If you are working on a complex job where several links, each containing several columns, go in and out of the Lookup stage, you can use the find/replace column facility to help locate a particular column or expression and change it. The find/replace facility enables you to: Find and replace a column name Find and replace expression text Find the next empty expression Find the next expression that contains an error To use the find/replace facilities, do one of the following: Click the find/replace button on the toolbar Choose find/replace from the link shortcut menu Type Ctrl-F The Find and Replace dialog box appears. It has three tabs:

21-14

Parallel Job Developers Guide

Lookup Stage

Editing Lookup Stages

Expression Text. Allows you to locate the occurrence of a particular string within an expression, and replace it if required. You can search up or down, and choose to match case, match whole words, or neither. You can also choose to replace all occurrences of the string within an expression. Columns Names. Allows you to find a particular column and rename it if required. You can search up or down, and choose to match case, match the whole word, or neither. Expression Types. Allows you to find the next empty expression or the next expression that contains an error. You can also press Ctrl-M to find the next empty expression or Ctrl-N to find the next erroneous expression.
Note The find and replace results are shown in the color specified in Tools Options.

Press F3 to repeat the last search you made without opening the Find and Replace dialog box.

Select Facilities
If you are working on a complex job where several links, each containing several columns, go in and out of the Lookup stage, you can use the select column facility to select multiple columns. The select facility enables you to: Select all columns whose expressions contains text that matches the text specified. Select all columns whose name contains the text specified (and, optionally, matches a specified type). Select all columns with a certain data type. Select all columns with missing or invalid expressions. To use the select facilities, choose Select from the link shortcut menu. The Select dialog box appears. It has three tabs: Expression Text. The Expression Text tab allows you to select all columns/stage variables whose expressions contain text that matches the text specified. The text specified is a simple text match, taking into account the Match case setting. Column Names. The Column Names tab allows you to select all column/stage variables whose Name contains the text specified. There is an additional Data Type drop down list, that will limit the columns selected to those with that data type. You can use the Data Type drop down list on its own to select all columns of a certain data type. For example, all string columns can be selected by leaving the text field blank, and selecting String as the data

Parallel Job Developers Guide

21-15

Editing Lookup Stages

Lookup Stage

type. The data types in the list are generic data types, where each of the column SQL data types belong to one of these generic types. Expression Types. The Expression Types tab allows you to select all columns with either empty expressions or invalid expressions.

Creating and Deleting Columns


You can create columns on links to the Lookup stage using any of the following methods: Select the link, then click the load column definition button in the toolbar to open the standard load columns dialog box. Use drag and drop or copy and paste functionality to create a new column by copying from an existing column on another link. Use the shortcut menus to create a new column definition. Edit the grids in the links meta data tab to insert a new column. When copying columns, a new column is created with the same meta data as the column it was copied from. To delete a column from within the Lookup Editor, select the column you want to delete and click the cut button or choose Delete Column from the shortcut menu.

Moving Columns Within a Link


You can move columns within a link using either drag and drop or cut and paste. Select the required column, then drag it to its new location, or cut it and paste it in its new location.

Editing Column Meta Data


You can edit column meta data from within the grid in the bottom of the Lookup Editor. Select the tab for the link meta data that you want to edit, then use the standard DataStage edit grid controls. The meta data shown does not include column derivations since these are edited in the links area.

Defining Output Column Derivations


You can define the derivation of output columns from within the Lookup Editor in a number of ways:

21-16

Parallel Job Developers Guide

Lookup Stage

Editing Lookup Stages

To map an input column (from data input or reference input) onto an output column you can use drag and drop or copy and paste to copy an input column to an output link. The output columns will have the same names as the input columns from which they were derived. If the output column already exists, you can drag or copy an input column to the output columns Derivation field. This specifies that the column is directly derived from an input column, with no transformations performed. You can use the column auto-match facility to automatically set that output columns are derived from their matching input columns. If a derivation is displayed in red (or the color defined in Tools Options), it means that the Lookup Editor considers it incorrect. To see why it is invalid, choose Validate Derivation from the shortcut menu. Once an output link column has a derivation defined that contains any input link columns, then a relationship line is drawn between the input column and the output column, as shown in the following example. This is a simple example; there can be multiple relationship lines either in or out of columns. You can choose whether to view the relationships for all links, or just the relationships for the selected links, using the button in the toolbar.

Column Auto-Match Facility


This time-saving feature allows you to automatically set columns on an output link to be derived from matching columns on an input link. Using this feature you can fill in all the output link derivations to route data from corresponding input columns, then go back and edit individual output link columns where you want a different derivation. To use this facility:
1

Do one of the following:


Click the Auto-match button in the Lookup Editor toolbar. Choose Auto-match from the input link header or output link header shortcut menu.

Parallel Job Developers Guide

21-17

Editing Lookup Stages

Lookup Stage

The Column Auto-Match dialog box appears:

2 3

Choose the output link that you want to match columns with the input link from the drop down list. Click Location match or Name match from the Match type area. If you choose Location match, this will set output column derivations to the input link columns in the equivalent positions. It starts with the first input link column going to the first output link column, and works its way down until there are no more input columns left. If you choose Name match, you need to specify further information for the input and output columns as follows:

Input columns: Match all columns or Match selected columns. Choose one of these to specify whether all input link columns should be matched, or only those currently selected on the input link. Ignore prefix. Allows you to optionally specify characters at the front of the column name that should be ignored during the matching procedure. Ignore suffix. Allows you to optionally specify characters at the end of the column name that should be ignored during the matching procedure.

Output columns: Ignore prefix. Allows you to optionally specify characters at the front of the column name that should be ignored during the matching procedure.

21-18

Parallel Job Developers Guide

Lookup Stage

Editing Lookup Stages

Ignore suffix. Allows you to optionally specify characters at the end of the column name that should be ignored during the matching procedure.

Ignore case. Select this check box to specify that case should be ignored when matching names. The setting of this also affects the Ignore prefix and Ignore suffix settings. For example, if you specify that the prefix IP will be ignored, and turn Ignore case on, then both IP and ip will be ignored.

Click OK to proceed with the auto-matching.

Note Auto-matching does not take into account any data type incompatibility between matched columns; the derivations are set regardless.

Defining Input Column Key Expressions


You can define key expressions for key fields of reference inputs. This is similar to defining derivations for output columns. The key expression is an equijoin from a primary input link column. You can specify it in two ways: Use drag and drop to drag a primary input link column to the appropriate key expression cell. Use copy and paste to copy a primary input link column and paste it on the appropriate key expression cell. A relationship link is drawn between the primary input link column and the key expression. You can also use drag and drop or copy and paste to copy an existing key expression to another input column, and you can drag or copy multiple selections. If a key expression is displayed in red (or the color defined in Tools Options), it means that the Transformer Editor considers it incorrect.

Parallel Job Developers Guide

21-19

Lookup Stage Properties

Lookup Stage

To see why it is invalid, choose Validate Derivation from the shortcut menu.

Lookup Stage Properties


The Lookup stage has a Properties dialog box which allows you to specify details about how the stage operates. The Lookup Stage Properties dialog box has three pages: Stage Page. This is used to specify general information about the stage. Inputs Page. This is where you specify details about the data input to the Transformer stage. Outputs Page. This is where you specify details about the output links from the Transformer stage.

Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes. The Link Ordering tab allows you to specify which order the input links are processed in. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules. The Build tab allows you to override the default compiler and linker flags for this particular stage.

Advanced Tab
This tab allows you to specify the following:

21-20

Parallel Job Developers Guide

Lookup Stage

Lookup Stage Properties

Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts the setting of the previous stage on the stream link. You can explicitly select Set or Clear. Select Set to request the next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Parallel Job Developers Guide

21-21

Lookup Stage Properties

Lookup Stage

Link Ordering Tab


This tab allows you to specify which input link is the primary link and the order in which the reference links are processed.

By default the input links will be processed in the order they were added. To rearrange them, choose an input link and click the up arrow button or the down arrow button. You can also access this tab by clicking the input link order button in the toolbar, or by choosing Reorder input links from the shortcut menu.

NLS Locale Tab


This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Lookup stage uses this when it is determining the order of the key

21-22

Parallel Job Developers Guide

Lookup Stage

Lookup Stage Properties

fields. Select a locale from the list, or click the arrow button next to the list to use a job parameter or browse for a collate file.

Build Tab
In some cases the Lookup stage may use C++ code to implement your lookup. In this case, you can use the Build tab to override the compiler and linker flags that have been set for the job or project. The flags you specify here will take effect for this stage and this stage alone.The flags available are platform and compiler-dependent.

Parallel Job Developers Guide

21-23

Lookup Stage Properties

Lookup Stage

Inputs Page
The Inputs page allows you to specify details about the incoming data set and the reference links. Choose a link from the Input name drop down list to specify which link you want to work on. The General tab allows you to specify an optional description of the link. When you are performing an in-memory lookup, the General tab has two additional fields: Save to lookup fileset. Allows you to specify a lookup file set to save the look up data. Diskpool. Specify the name of the disk pool into which to write the file set. You can also specify a job parameter. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Advanced tab allows you to change the default buffering settings for the input link. Details about Lookup stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the lookup is performed. It also allows you to specify that the data should be sorted before the lookup. Note that you cannot specify partitioning or sorting on the reference links, this is specified in their source stage. By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on the previous stage in the job the stage will warn you when the job runs if it cannot preserve the partitioning of the incoming data. If the Lookup stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Lookup stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Lookup stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning.

21-24

Parallel Job Developers Guide

Lookup Stage

Lookup Stage Properties

You may need to ensure that your lookup tables have been partitioned using the Entire method, so that the lookup tables will always contain the full set of data that might need to be looked up. For lookup files and lookup tables being looked up in databases, the partitioning is performed on those stages. If the Lookup stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the Lookup stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for the Lookup stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

Parallel Job Developers Guide

21-25

Lookup Stage Properties

Lookup Stage

Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the lookup is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Lookup stage. The Lookup stage can have only one output link. It can also have a single reject link, where records can be sent if the lookup fails. The Output Link drop-down list allows you to choose whether you are looking at details of the main output link (the stream link) or the reject link. The General tab allows you to specify an optional description of the output link. The Advanced tab allows you to change the default buffering settings for the output links.

21-26

Parallel Job Developers Guide

Lookup Stage

Lookup Stage Conditions

Reject Links
You cannot set the mapping or edit the column definitions for a reject link. The link uses the column definitions for the primary input link.

Lookup Stage Conditions


The Lookup stage has a Lookup Stage Conditions dialog box that allows you to specify: Which reference link (if any) can return multiple rows from a lookup. A condition that should be fulfilled before a lookup is performed on a reference link. What action should be taken if a condition on a reference link is not met. What action should be taken if a lookup on a link fails. You can open the Lookup Stage Conditions dialog box by: Double-clicking on the Condition: bar on a reference link. Selecting Conditions from the background shortcut menu. Clicking the Conditions toolbar button. Selecting Conditions from the link shortcut menu.

To specify that a link can legitimately return multiple rows: Select the link name from the Multiple rows returned from link drop-down list (note that only one reference link in a Lookup stage is allowed to return multiple rows, and that this feature is only available for in-memory lookups). To specify a condition for a reference link:

Parallel Job Developers Guide

21-27

Lookup Stage Conditions

Lookup Stage

Double click on the Condition field for the link you want to specify a condition for. The field expands to let you type in a condition, or click the browse button to open the expression editor to get help in specifying an expression. The condition should return a TRUE/FALSE result (for example DSLINK1.COL1 > 0).

To specify the action taken if the specified condition is not met: Choose an action from the Condition Not Met drop-down list. Possible actions are:

Continue. The fields from that link are set to NULL if the field is nullable, or to a default value if not. Continues processing any further lookups before sending the row to the output link. Drop. Drops the row and continues with the next lookup. Fail. Causes the job to issue a fatal error and stop. Reject. Sends the row to the reject link.

To specify the action taken if a lookup on a link fails: Choose an action from the Lookup Failure drop-down list. Possible actions are:

Continue. The fields from that link are set to NULL if the field is nullable, or to a default value if not. Continues processing any further lookups before sending the row to the output link. Drop. Drops the row and continues with the next lookup. Fail. Causes the job to issue a fatal error and stop. Reject. Sends the row to the reject link.

21-28

Parallel Job Developers Guide

Lookup Stage

The DataStage Expression Editor

The DataStage Expression Editor


The DataStage Expression Editor helps you to enter correct expressions when you edit Lookup stages. The Expression Editor can: Facilitate the entry of expression elements Complete the names of frequently used variables Validate the expression The Expression Editor can be opened from: Lookup Stage Conditions dialog box

Expression Format
The format of an expression is as follows:
KEY: something_like_this is a token something_in_italics is a terminal, i.e., doesn't break down any further | is a choice between tokens [ is an optional part of the construction "XXX" is a literal token (i.e., use XXX not including the quotes)

=================================================
expression ::= function_call | variable_name | other_name | constant | unary_expression | binary_expression | if_then_else_expression | substring_expression | "(" expression ")" function_call ::= function_name "(" [argument_list] ")" argument_list ::= expression | expression "," argument_list function_name ::= name of a built-in function | name of a user-defined_function variable_name ::= job_parameter name other_name ::= name of a built-in macro, system variable, etc. constant ::= numeric_constant | string_constant numeric_constant ::= ["+" | "-"] digits ["." [digits]] ["E" | "e" ["+" | "-"] digits] string_constant ::= "'" [characters] "'" | """ [characters] """ | "\" [characters] "\" unary_expression ::= unary_operator expression unary_operator ::= "+" | "-" binary_expression ::= expression binary_operator expression

Parallel Job Developers Guide

21-29

The DataStage Expression Editor

Lookup Stage

binary_operator ::= arithmetic_operator | concatenation_operator | matches_operator | relational_operator | logical_operator arithmetic_operator ::= "+" | "-" | "*" | "/" | "^" concatenation_operator ::= ":" relational_operator ::= "=" |"EQ" | "<>" | "#" | "NE" | ">" | "GT" | ">=" | "=>" | "GE" | "<" | "LT" | "<=" | "=<" | "LE" logical_operator ::= "AND" | "OR" if_then_else_expression ::= "IF" expression "THEN" expression "ELSE" expression substring_expression ::= expression "[" [expression ["," expression] "]" field_expression ::= expression "[" expression "," expression "," expression "]" /* That is, always 3 args Note: keywords like "AND" or "IF" or "EQ" may be in any case

Entering Expressions
Whenever the insertion point is in an expression box, you can use the Expression Editor to suggest the next element in your expression. Do this by right-clicking the box, or by clicking the Suggest button to the right of the box. This opens the Suggest Operand or Suggest Operator menu. Which menu appears depends on context, i.e., whether you should be entering an operand or an operator as the next expression element. The Functions available from this menu are described in Appendix B. The DS macros are described "Job Status Macros" in Parallel Job Advanced Developers Guide. You can also specify custom routines for use in the expression editor (see "Working with Parallel Routines" in DataStage Manager Guide). Suggest Operand Menu:

21-30

Parallel Job Developers Guide

Lookup Stage

The DataStage Expression Editor

Suggest Operator Menu:

Completing Variable Names


The Expression Editor stores variable names. When you enter a variable name you have used before, you can type the first few characters, then press F5. The Expression Editor completes the variable name for you. If you enter the name of the input link followed by a period, for example, DailySales., the Expression Editor displays a list of the column names of the link. If you continue typing, the list selection changes to match what you type. You can also select a column name using the mouse. Enter a selected column name into the expression by pressing Tab or Enter. Press Esc to dismiss the list without selecting a column name.

Validating the Expression


When you have entered an expression in the Lookup Editor, press Enter to validate it. The Expression Editor checks that the syntax is correct and that any variable names used are acceptable to the compiler. If there is an error, a message appears and the element causing the error is highlighted in the expression box. You can either correct the expression or close the Lookup Editor or Lookup dialog box. For any expression, selecting Validate from its shortcut menu will also validate it and show any errors in a message box.

Exiting the Expression Editor


You can exit the Expression Editor in the following ways: Press Esc (which discards changes). Press Return (which accepts changes). Click outside the Expression Editor box (which accepts changes).

Parallel Job Developers Guide

21-31

The DataStage Expression Editor

Lookup Stage

Configuring the Expression Editor


You can resize the Expression Editor window by dragging. The next time you open the expression editor in the same context (for example, editing output columns) on the same client, it will have the same size. The Expression Editor is configured by editing the Designer options. This allows you to specify how helpful the expression editor is. For more information, see "Specifying Designer Options" in DataStage Designer Guide.

21-32

Parallel Job Developers Guide

22
Funnel Stage
The Funnel stage is a processing stage. It copies multiple input data sets to a single output data set. This operation is useful for combining separate data sets into a single large data set. The stage can have any number of input links and a single output link.

The Funnel stage can operate in one of three modes: Continuous Funnel combines the records of the input data in no guaranteed order. It takes one record from each input link in turn. If data is not available on an input link, the stage skips to the next link rather than waiting. Sort Funnel combines the input records in the order defined by the value(s) of one or more key columns and the order of the output records is determined by these sorting keys. Sequence copies all records from the first input data set to the output data set, then all the records from the second input data set, and so on.
Parallel Job Developers Guide 22-1

Examples

Funnel Stage

For all methods the meta data of all input data sets must be identical. The sort funnel method has some particular requirements about its input data. All input data sets must be sorted by the same key columns as to be used by the Funnel operation. Typically all input data sets for a sort funnel operation are hashpartitioned before theyre sorted (choosing the auto partitioning method will ensure that this is done). Hash partitioning guarantees that all records with the same key column values are located in the same partition and so are processed on the same node. If sorting and partitioning are carried out on separate stages before the Funnel stage, this partitioning must be preserved. The sortfunnel operation allows you to set one primary key and multiple secondary keys. The Funnel stage first examines the primary key in each input record. For multiple records with the same primary key value, it then examines secondary keys to determine the order of records it will output. The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify details about the data sets being joined. Outputs Page. This is where you specify details about the joined data being output from the stage.

Examples
Continuous Funnel Example
Our example data comprises seven separate data sets. Each data set contains a list of the residents of Woodstock for different years: 1616,

22-2

Parallel Job Developers Guide

Funnel Stage

Examples

1617, 1619, 1622, 1627, 1662, and 1687. The following is a sample of the 1627 data set:

The Funnel stage, when set to continuous funnel, will combine these into a single data set. The job to perform the funnel is as follows:

Parallel Job Developers Guide

22-3

Examples

Funnel Stage

The continuous funnel method is selected on the Stage page Properties tab of the Funnel stage:

The continuous funnel method does not attempt to impose any order on the data it is processing. It simply writes rows as they become available on the input links. In our example the stage has written a row from each input link in turn. A sample of the final, funneled, data is as follows:

Sort Funnel Example


In this example we are going to use the funnel stage to sort the Woodstock by inhabitants names as it combines the data into a single

22-4

Parallel Job Developers Guide

Funnel Stage

Examples

data set. The data and the basic job are the same as for the Continuous Funnel example, but now we set the Funnel stage properties as follows:

The following is a sample of the output data set:

Note If you are running your sort funnel stage in parallel, you should be aware of the various considerations about sorting data and partitions. These are described in Chapter 23, "Sort Stage."

Parallel Job Developers Guide

22-5

Examples

Funnel Stage

Sequence Funnel Example


In this example we funnel the Woodstock data on input one data set at a time. We end up with a data set that contains all the 1616 inhabitants, then all the 1617 ones, then all the 1619 ones and so on. Again the basic job and the source data are the same as for the continuous funnel example. The Funnel stage properties are set as follows:

When using the sequence method, you need to specify the order in which the Funnel stage processes its input links, as this affects the order of the sequencing. This is done on the Stage page Link Ordering tab:

22-6

Parallel Job Developers Guide

Funnel Stage

Must Dos

The following is a sample of the output data set:

If you run the sequence funnel stage in parallel, you need to be mindful of the effects of data partitioning. If, for example, you ran our example job on a four-node system, you would get four partions each containing a section of 1616 data, a section of 1617 data, a section of 1619 and so on.

Must Dos
DataStage has many defaults which means that it can be very easy to include Funnel stages in a job. This section specifies the minimum steps to take to get a Funnel stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Funnel stage: In the Stage Page Properties Tab, specify the Funnel Type. Continuous Funnel is the default, but you can also choose Sequence or Sort Funnel. If you choose to use the Sort Funnel method, you also need to specify the key on which data will be sorted. You can repeat the key property to specify a composite key. If you are using the Sequence method, in the Stage Page Link Ordering Tab specify the order in which your data sets will be combined.

Parallel Job Developers Guide

22-7

Stage Page

Funnel Stage

In the Output Page Mapping Tab, specify how the output columns are derived.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link Ordering tab allows you to specify which order the input links are processed in. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/Property Values
Options/Funnel Type Continuous Funnel/ Sequence/ Sort funnel Input Column Ascending/ Descending First/Last

Default
Continuous Funnel

Mandatory? Repeats?
Y N

Dependent of
N/A

Sorting Keys/Key

N/A

Y (if Funnel Type = Sort Funnel) Y (if Funnel Type = Sort Funnel) Y (if Funnel Type = Sort Funnel) N N

N/A

Sorting Keys/Sort Order Sorting Keys/Nulls position Sorting Keys/Case Sensitive Sorting Keys/Sort as EBCDIC

Ascending

Key

First

Key

True/False True/False

True False

N N

Key Key

22-8

Parallel Job Developers Guide

Funnel Stage

Stage Page

Options Category
Funnel Type Specifies the type of Funnel operation. Choose from: Continuous Funnel Sequence Sort Funnel The default is Continuous Funnel.

Sorting Keys Category


Key This property is only required for Sort Funnel operations. Specify the key column that the sort will be carried out on. The first column you specify is the primary key, you can add multiple secondary keys by repeating the key property. You can use the Column Selection dialog box to select several keys at once if required (see page 3-10). Key has the following dependent properties: Sort Order Choose Ascending or Descending. The default is Ascending. Nulls position By default columns containing null values appear first in the funneled data set. To override this default so that columns containing null values appear last in the funneled data set, select Last. Sort as EBCDIC To sort as in the EBCDIC character set, choose True. Case Sensitive Use this to specify whether each key is case sensitive or not, this is set to True by default, i.e., the values CASE and case would not be judged equivalent.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by
Parallel Job Developers Guide 22-9

Stage Page

Funnel Stage

any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts the setting which results from ORing the settings of the input stages, i.e., if any of the input stages uses Set then this stage will use Set. You can explicitly select Set or Clear. Select Set to request that the next stage in the job attempts to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

22-10

Parallel Job Developers Guide

Funnel Stage

Stage Page

Link Ordering Tab


This tab allows you to specify the order in which links input to the Funnel stage are processed. This is only relevant if you have chosen the Sequence Funnel Type.

By default the input links will be processed in the order they were added. To rearrange them, choose an input link and click the up arrow button or the down arrow button.

NLS Locale Tab


This appears if you have NLS enabled on your system. If you are using the Sort Funnel funnel type, it lets you view the current default collate convention, and select a different one for this stage if required (for other funnel types, it is blank). You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Funnel stage uses this when it is determining the sort order for sort funnel. Select a locale from the list, or click the arrow

Parallel Job Developers Guide

22-11

Inputs Page

Funnel Stage

button next to the list to use a job parameter or browse for a collate file.

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. Choose an input link from the Input name drop down list to specify which link you want to work on. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being funneled. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Funnel stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning on Input Links


The Partitioning tab allows you to specify details about how the data on each of the incoming links is partitioned or collected before it is funneled. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of

22-12

Parallel Job Developers Guide

Funnel Stage

Inputs Page

current and preceding stages and how many nodes are specified in the Configuration file. If the Funnel stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Funnel stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Funnel stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If you are using the Sort Funnel method, and havent partitioned the data in a previous stage, you should key partition it by choosing the Hash or modulus partition method on this tab. If the Funnel stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Funnel stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .
Parallel Job Developers Guide 22-13

Inputs Page

Funnel Stage

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Funnel stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being funneled. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. If you are using the Sort Funnel method, and havent sorted the data in a previous stage, you should sort it here using the same keys that the data is hash partitioned on and funneled on. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you

22-14

Parallel Job Developers Guide

Funnel Stage

Outputs Page

can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Funnel stage. The Funnel stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Funnel stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output link. Details about Funnel stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Mapping Tab
For Funnel stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the input columns. These are read only and cannot be modified on this tab. It is a requirement of the Funnel stage that all input links have identical meta data, so only one set of column definitions is shown.

Parallel Job Developers Guide

22-15

Outputs Page

Funnel Stage

The right pane shows the output columns for each link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Automatch facility. In the above example the left pane represents the incoming data after it has been funneled. The right pane represents the data being output by the stage after the funnel operation. In this example the data has been mapped straight across.

22-16

Parallel Job Developers Guide

23
Sort Stage
The Sort stage is a processing stage. It is used to perform more complex sort operations than can be provided for on the Input page Partitioning tab of parallel job stage editors. You can also use it to insert a more explicit simple sort operation where you want to make your job easier to understand. The Sort stage has a single input link which carries the data to be sorted, and a single output link carrying the sorted data.

You specify sorting keys as the criteria on which to perform the sort. A key is a column on which to sort the data, for example, if you had a name column you might specify that as the sort key to produce an alphabetical list of names. The first column you specify as a key to the stage is the primary key, but you can specify additional secondary keys. If multiple rows have the same value for the primary key column, then DataStage uses the secondary columns to sort these rows.

Parallel Job Developers Guide

23-1

Sort Stage

You can sort in sequential mode to sort an entire data set or in parallel mode to sort data within partitions, as shown below:
Tom Dick Harry Jack Ted Mary Bob Jane Monica Bill Dave Mike Bill Bob Dave Dick Harry Jack Jane Mary Mike Monica Ted Tom

Sort Stage

Running sequentially

Tom Dick Harry Jack

Sort Stage

Dick Harry Jack Tom

Ted Mary Bob jane

Running in parallel

Bob Jane Mary Tod

Monica Bill Dave Mike

Bill Dave Mike Monica

The stage uses temporary disk space when performing a sort. It looks in the following locations, in the following order, for this temporary space.
1 2 3 4

Scratch disks in the disk pool sort (you can create these pools in the configuration file). Scratch disks in the default disk pool (scratch disks are included here by default). The directory specified by the TMPDIR environment variable. The directory /tmp.

You may perform a sort for several reasons. For example, you may want to sort a data set by a zip code column, then by last name within the zip code. Once you have sorted the data set, you can filter the data set by comparing adjacent records and removing any duplicates. However, you must be careful when processing a sorted data set: many types of processing, such as repartitioning, can destroy the sort order of the data. For example, assume you sort a data set on a system with four processing nodes and store the results to a data set stage. The data set will therefore have four partitions. You then use that data set as input to a stage executing on a different number of

23-2

Parallel Job Developers Guide

Sort Stage

Examples

nodes, possibly due to node constraints. DataStage automatically repartitions a data set to spread out the data set to all nodes in the system, unless you tell it not to, possibly destroying the sort order of the data. You could avoid this by specifying the Same partitioning method. The stage does not perform any repartitioning as it reads the input data set; the original partitions are preserved. You must also be careful when using a stage operating sequentially to process a sorted data set. A sequential stage executes on a single processing node to perform its action. Sequential stages will collect the data where the data set has more than one partition, which may also destroy the sorting order of its input data set. You can overcome this if you specify the collection method as follows: If the data was range partitioned before being sorted, you should use the ordered collection method to preserve the sort order of the data set. Using this collection method causes all the records from the first partition of a data set to be read first, then all records from the second partition, and so on. If the data was hash partitioned before being sorted, you should use the sort merge collection method specifying the same collection keys as the data was partitioned on.
Note If you write a sorted data set to an RDBMS there is no guarantee that it will be read back in the same order unless you specifically structure the SQL query to ensure this.

By default the stage will sort with the native DataStage sorter, but you can also specify that it uses the UNIX sort command. The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify details about the data sets being sorted. Outputs Page. This is where you specify details about the sorted data being output from the stage.

Examples
Sequential Sort
This job sorts the contents of a sequential file, and writes it to a data set. The data is a list of the known inhabitants of Woodstock, England

Parallel Job Developers Guide

23-3

Examples

Sort Stage

in the seventeenth century, roughly sorted by source document date. We are going to sort it by surname instead.

Here is a sample of the input data (as seen from the source Sequential File stage, csv_input:

The meta data for the file is as follows:

23-4

Parallel Job Developers Guide

Sort Stage

Examples

The Sequential File stage runs sequentially because it only has one source file to read. The Sort stage is set to run sequentially on the Stage page Advanced tab. The sort stage properties are used to specify the column Sname as the primary sort key and Fname as the secondary sort key:

When the job is run the data is sorted into a single partition. The Data Set stage, woodstock_sorted, is set to run sequentially to write the data to a single partition. Here is a sample of the sorted data (viewed from the Data Set stage):

Parallel Job Developers Guide

23-5

Examples

Sort Stage

Parallel Sort
This example uses the same job and the same data as the last previous example, but this time we are going to run the Sort stage in parallel and create a number of partititions. In the Sort stage we specify parallel execution in the Stage page Advanced tab. In the Inputs page Partitioning tab we specify a partioning type of Hash, and specify the column Sname as the hash key. Because the partioning takes place on the input link, the data is partitioned before the sort stage actually tries to sort it. We hash partition to ensure that instances of the same surnames end up in the same partition. The data is then sorted within those partitions. We run the job on a four-node system, so end up with a data set comprising four partitions.

The following is a sample of the data in partition 2 after partitioning, but before sorting:

23-6

Parallel Job Developers Guide

Sort Stage

Examples

And here is a sample of the data in partition 2 after it has been processed by the sort stage:

Our parallel sort example has left us with four partitions, each containing roughly a quarter of the Woodstock data, and each ordered by name. The following shows the first 24 names in each partition.

Partition 0

Partition 1

Partition 2

Partition 3

If we want to bring the data back together into a single partition, for example to write to another sequential file, we need to be mindful of how it is collected, or we will lose the benefit of the sort. If we use the sort/merge collection method, specifying the Sname column as the collection key, we will end up with a totally sorted data set.

Parallel Job Developers Guide

23-7

Examples

Sort Stage

Total Sort
You can also perform a total sort on a parallel data set, such that the data is ordered within each partition and the partitions themselves are ordered. A total sort requires that all similar and duplicate records are located in the same partition of the data set. Similarity is based on the key fields in a record. The partitions also need to be approximately the same size so that no one node becomes a processing bottleneck. In order to meet these two requirements, the input data is partitioned using the range partitioner. This guarantees that all records with the same key fields are in the same partition, and calculates the partition boundaries based on the key field to ensure fairly even distribution. In order to use the range partitioner you must first take a sample of your input data, sort it, then use it to build a range partition map as described in Chapter 55, "Write Range Map Stage." You then specify this map when setting up the range partitioner in the Inputs page Partitioning tab of your Sort stage.

23-8

Parallel Job Developers Guide

Sort Stage

Must Dos

When you run the job it will produce a totally sorted data set across the four partitions. The following shows the first 24 names in each of these partitions:

Partition 0

Partition 1

Partition 2

Partition 3

Must Dos
DataStage has many defaults which means that it can be very easy to include Sort stages in a job. This section specifies the minimum steps to take to get a Sort stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Sort stage: In the Stage Page Properties Tab, under the Sorting Keys category:

specify the key that you are sorting on. Repeat the property to specify a composite key.

In the Output Page Mapping Tab, specify how the output columns are derived.

Parallel Job Developers Guide

23-9

Stage Page

Sort Stage

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Sorting Keys/Key Sorting Keys/Sort Order Sorting Keys/Nulls position (only available for Sort Utility = DataStage) Sorting Keys/Sort as EBCDIC Sorting Keys/Case Sensitive Sorting Keys/Sort Key Mode (only available for Sort Utility = DataStage) Options/Sort Utility

Values
Input Column Ascending/ Descending First/Last

Default
N/A Ascending First

Mandatory? Repeats?
Y Y N Y N N

Dependent of
N/A Key Key

True/False True/False

False True

N N Y

N N N

Key Key Key

Sort/Dont Sort Sort (Previously Grouped)/Dont Sort (Previously Sorted) DataStage/ UNIX DataStage

N/A

23-10

Parallel Job Developers Guide

Sort Stage

Stage Page

Category/ Property
Options/Stable Sort

Values
True/False

Default

Mandatory? Repeats?
N

Dependent of
N/A

True for Sort Y Utility = DataStage, False otherwise True Y

Options/Allow Duplicates (not available for Sort Utility = UNIX) Options/Output Statistics Options/Create Cluster Key Change Column (only available for Sort Utility = DataStage) Options/Create Key Change Column Options/Restrict Memory Usage Options/ Workspace

True/False

N/A

True/False True/False

False False

Y N

N N

N/A N/A

True/False

False

N/A

number MB string

20 N/A

N N

N N

N/A N/A

Sorting Keys Category


Key Specifies the key column for sorting. This property can be repeated to specify multiple key columns. You can use the Column Selection dialog box to select several keys at once if required (see page 3-10). Key has dependent properties depending on the Sort Utility chosen: Sort Order All sort types. Choose Ascending or Descending. The default is Ascending. Nulls position This property appears for sort type DataStage and is optional. By default columns containing null values appear first in the sorted data set. To override this default so that columns containing null values appear last in the sorted data set, select Last.

Parallel Job Developers Guide

23-11

Stage Page

Sort Stage

Sort as EBCDIC To sort as in the EBCDIC character set, choose True. Case Sensitive All sort types. This property is optional. Use this to specify whether each group key is case sensitive or not, this is set to True by default, i.e., the values CASE and case would not be judged equivalent. Sort Key Mode This property appears for sort type DataStage. It is set to Sort by default and this sorts on all the specified key columns. Set to Dont Sort (Previously Sorted) to specify that input records are already sorted by this column. The Sort stage will then sort on secondary key columns, if any. This option can increase the speed of the sort and reduce the amount of temporary disk space when your records are already sorted by the primary key column(s) because you only need to sort your data on the secondary key column(s). Set to Dont Sort (Previously Grouped) to specify that input records are already grouped by this column, but not sorted. The operator will then sort on any secondary key columns. This option is useful when your records are already grouped by the primary key column(s), but not necessarily sorted, and you want to sort your data only on the secondary key column(s) within each group

Options Category
Sort Utility The type of sort the stage will carry out. Choose from: DataStage. The default. This uses the built-in DataStage sorter, you do not require any additional software to use this option. UNIX. This specifies that the UNIX sort command is used to perform the sort. Stable Sort Applies to a Sort Utility type of DataStage, the default is True. It is set to True to guarantee that this sort operation will not rearrange records that are already in a properly sorted data set. If set to False no prior ordering of records is guaranteed to be preserved by the sorting operation.

23-12

Parallel Job Developers Guide

Sort Stage

Stage Page

Allow Duplicates Set to True by default. If False, specifies that, if multiple records have identical sorting key values, only one record is retained. If Stable Sort is True, then the first record is retained. This property is not available for the UNIX sort type. Output Statistics Set False by default. If True it causes the sort operation to output statistics. This property is not available for the UNIX sort type. Create Cluster Key Change Column This property appears for sort type DataStage and is optional. It is set False by default. If set True it tells the Sort stage to create the column clusterKeyChange in each output record. The clusterKeyChange column is set to 1 for the first record in each group where groups are defined by using a Sort Key Mode of Dont Sort (Previously Sorted) or Dont Sort (Previously Grouped). Subsequent records in the group have the clusterKeyChange column set to 0. Create Key Change Column This property appears for sort type DataStage and is optional. It is set False by default. If set True it tells the Sort stage to create the column KeyChange in each output record. The KeyChange column is set to 1 for the first record in each group where the value of the sort key changes. Subsequent records in the group have the KeyChange column set to 0. Restrict Memory Usage This is set to 20 by default. It causes the Sort stage to restrict itself to the specified number of megabytes of virtual memory on a processing node. We recommend that the number of megabytes specified is smaller than the amount of physical memory on a processing node. Workspace This property appears for sort type UNIX only. Optionally specifies the workspace used by the stage.

Parallel Job Developers Guide

23-13

Stage Page

Sort Stage

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Preserve partitioning. This is Set by default. You can explicitly select Set or Clear. Select Set to request the next stage in the job should attempt to maintain the partitioning. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

NLS Locale Tab


This appears if you have NLS enabled on your system. If you are using the DataStage sort type, it lets you view the current default collate convention, and select a different one for this stage if required (for UNIX sorts, it is blank). You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Sort stage uses this when it is determining the order of

23-14

Parallel Job Developers Guide

Sort Stage

Inputs Page

the sorted fields. Select a locale from the list, or click the arrow button next to the list to use a job parameter or browse for a collate file.

Inputs Page
The Inputs page allows you to specify details about the data coming in to be sorted. The Sort stage can have only one input link. The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Sort stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the sort is performed. By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on the previous stage in the job the stage will warn you when the job runs if it cannot preserve the partitioning of the incoming data.

Parallel Job Developers Guide

23-15

Inputs Page

Sort Stage

If the Sort Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Sort stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Sort stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Sort stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the Sort stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button .

23-16

Parallel Job Developers Guide

Sort Stage

Inputs Page

The following Collection methods are available: (Auto). This is the default collection method for the Sort stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the Sort is performed. This is a standard feature of the stage editors, if you make use of it you will be running a simple sort before the main Sort operation that the stage provides. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Parallel Job Developers Guide

23-17

Outputs Page

Sort Stage

Outputs Page
The Outputs page allows you to specify details about data output from the Sort stage. The Sort stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Sort stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output link. Details about Sort stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Mapping Tab
For Sort stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them.

The left pane shows the columns of the sorted data. These are read only and cannot be modified on this tab. This shows the meta data from the input link. The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Automatch facility. In the above example the left pane represents the incoming data after the sort has been performed. The right pane represents the data being output by the stage after the sort operation. In this example the data has been mapped straight across.

23-18

Parallel Job Developers Guide

24
Remove Duplicates Stage
The Remove Duplicates stage is a processing stage. It can have a single input link and a single output link. The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate rows, and writes the results to an output data set.

Removing duplicate records is a common way of cleansing a data set before you perform further processing. Two rows are considered duplicates if they are adjacent in the input data set and have identical values for the key column(s). A key column is any column you designate to be used in determining whether two rows are identical. The data set input to the Remove Duplicates stage must be sorted so that all records with identical key values are adjacent. You can either achieve this using the in-stage sort facilities available on the Inputs page Partitioning tab, or have an explicit Sort stage feeding the Remove Duplicates stage. The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify details about the data set that is having its duplicates removed.

Parallel Job Developers Guide

24-1

Example

Remove Duplicates Stage

Output Page. This is where you specify details about the processed data that is being output from the stage.

Example
In our example our data is a list of people who were allocated land in the village of Stewkley, Buckinghamshire by the 1812 enclosure award. The data contains some duplicate entries, and we want to remove these. Here is a sample of the input data:

Here is the job that will remove the duplicates:

The first step is to sort the data so that the duplicates are actually next to each other. As with all sorting operations, there are implications around data partitions if you run the job in parallel (see Chapter 23, "Sort Stage," for a discussion of these). You should hash partition the data using the sort keys as hash keys in order to guarantee that duplicate rows are in the same partition. In our example we sort on

24-2

Parallel Job Developers Guide

Remove Duplicates Stage

Example

the Firstname and Lastname columns, and our sample of the sorted data shows up some duplicates:

Next, we set up the Remove Duplicates stage to remove rows that share the same values in the Firstname and Lastname columns. The stage will retain the first of the duplicate records:

Parallel Job Developers Guide

24-3

Must Dos

Remove Duplicates Stage

Here is a sample of the data after the job has been run and the duplicates removed:

Must Dos
DataStage has many defaults which means that it can be very easy to include Remove Duplicates stages in a job. This section specifies the minimum steps to take to get a Remove Duplicates stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Remove Duplicates stage: In the Stage Page Properties Tab select the key column. Identical values in this column will be taken to denote duplicate rows, which the stage will remove. Repeat the property to specify a composite key. In the Outputs Page Mapping Tab, specify how output columns are derived.

24-4

Parallel Job Developers Guide

Remove Duplicates Stage

Stage Page

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Keys that Define Duplicates/Key Keys that Define Duplicates/Sort as EBCDIC Keys that Define Duplicates/Case Sensitive Options/Duplicate to retain

Values
Input Column True/False

Default
N/A False

Mandatory?
Y N

Repeats?
Y N

Dependent of
N/A Key

True/False

True

Key

First/Last

First

N/A

Keys that Define Duplicates Category


Key Specifies the key column for the operation. This property can be repeated to specify multiple key columns. You can use the Column Selection dialog box to select several keys at once if required (see page 3-10). Key has dependent properties as follows: Sort as EBCDIC To sort as in the EBCDIC character set, choose True.

Parallel Job Developers Guide

24-5

Stage Page

Remove Duplicates Stage

Case Sensitive Use this to specify whether each key is case sensitive or not, this is set to True by default, i.e., the values CASE and case would not be judged equivalent.

Options Category
Duplicate to retain Specifies which of the duplicate columns encountered to retain. Choose between First and Last. It is set to First by default.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

24-6

Parallel Job Developers Guide

Remove Duplicates Stage

Inputs Page

NLS Locale Tab


This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Remove Duplicates stage uses this when it is determining the sort order for the key column(s). Select a locale from the list, or click the arrow button next to the list to use a job parameter or browse for a collate file.

Inputs Page
The Inputs page allows you to specify details about the data coming in to be sorted. Choose an input link from the Input name drop down list to specify which link you want to work on. The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Remove Duplicates stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Parallel Job Developers Guide

24-7

Inputs Page

Remove Duplicates Stage

Partitioning on Input Links


The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the operation is performed. By default the stage uses the auto partitioning method. If the Remove Duplicates stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Remove Duplicates stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Remove Duplicates stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Remove Duplicates stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the Remove Duplicates stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place.

24-8

Parallel Job Developers Guide

Remove Duplicates Stage

Inputs Page

DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for the Remove Duplicates stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the remove duplicates operation is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for
Parallel Job Developers Guide 24-9

Output Page

Remove Duplicates Stage

each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Output Page
The Outputs page allows you to specify details about data output from the Remove Duplicates stage. The stage only has one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Remove Duplicates stage and the output columns. The Advanced tab allows you to change the default buffering settings for the output link. Details about Remove Duplicates stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Mapping Tab
For Remove Duplicates stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them.

The left pane shows the columns of the input data. These are read only and cannot be modified on this tab. This shows the meta data from the incoming link.
24-10 Parallel Job Developers Guide

Remove Duplicates Stage

Output Page

The right pane shows the output columns for the master output link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Auto-match facility. In the above example the left pane represents the incoming data after the remove duplicates operation has been performed. The right pane represents the data being output by the stage after the remove duplicates operation. In this example the data has been mapped straight across.

Parallel Job Developers Guide

24-11

Output Page

Remove Duplicates Stage

24-12

Parallel Job Developers Guide

25
Compress Stage
The Compress stage is a processing stage. It can have a single input link and a single output link. The Compress stage uses the UNIX compress or GZIP utility to compress a data set. It converts a data set from a sequence of records into a stream of raw binary data. The complement to the Compress stage is the Expand stage, which is described in Chapter 26. A compressed data set is similar to an ordinary data set and can be stored in a persistent form by a Data Set stage. However, a compressed data set cannot be processed by many stages until it is expanded, that is, until its rows are returned to their normal format. Stages that do not perform column-based processing or reorder the rows can operate on compressed data sets. For example, you can use the copy stage to create a copy of the compressed data set. Because compressing a data set removes its normal record boundaries, the compressed data set must not be repartitioned before it is expanded.

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage.

Parallel Job Developers Guide

25-1

Must Dos

Compress Stage

Input Page. This is where you specify details about the data set being compressed. Output Page. This is where you specify details about the compressed data being output from the stage.

Must Dos
DataStage has many defaults which means that it can be very easy to include Compress stages in a job. This section specifies the minimum steps to take to get a Compress stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Compress stage: In the Stage Page Properties Tab choose the compress command to use. Compress is the default but you can also choose gzip. Ensure column meta data is defined for both the input and output link.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. The stage only has a single property which determines whether the stage uses compress or GZIP .
Category/ Property
Options/Command

Values
compress/gzip

Default
compress

Mandatory?
Y

Repeats?
N

Dependent of
N/A

25-2

Parallel Job Developers Guide

Compress Stage

Input Page

Options Category
Command Specifies whether the stage will use compress (the default) or GZIP .

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Set by default. You can explicitly select Set or Clear. Select Set to request the next stage should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Input Page
The Inputs page allows you to specify details about the data set being compressed. There is only one input link. The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies

Parallel Job Developers Guide

25-3

Input Page

Compress Stage

the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Compress stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning on Input Links


The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the compress is performed. By default the stage uses the auto partitioning method. If the Compress stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Compress stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Compress stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Compress stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the Compress stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

25-4

Parallel Job Developers Guide

Compress Stage

Input Page

Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for the Compress stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the compression is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default.

Parallel Job Developers Guide

25-5

Output Page

Compress Stage

Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Output Page
The Outputs page allows you to specify details about data output from the Compress stage. The stage only has one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. See Chapter 3, "Stage Editors," for a general description of the tabs.

25-6

Parallel Job Developers Guide

26
Expand Stage
The Expand stage is a processing stage. It can have a single input link and a single output link. The Expand stage uses the UNIX uncompress or GZIP utility to expand a data set. It converts a previously compressed data set back into a sequence of records from a stream of raw binary data. The complement to the Expand stage is the Compress stage which is described in Chapter 25.

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Input Page. This is where you specify details about the data set being expanded. Output Page. This is where you specify details about the expanded data being output from the stage.

Parallel Job Developers Guide

26-1

Must Dos

Expand Stage

Must Dos
DataStage has many defaults which means that it can be very easy to include Expand stages in a job. This section specifies the minimum steps to take to get an Expand stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use an Expand stage: In the Stage Page Properties Tab choose the uncompress command to use. This is uncompress by default but you can also choose gzip. Ensure column meta data is defined for both the input and output link.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. The stage only has a single property which determines whether the stage uses uncompress or GZIP .
Category/ Property
Options/Command

Values
uncompress/ gzip

Default
compress

Mandatory?
Y

Repeats?
N

Dependent of
N/A

Options Category
Command Specifies whether the stage will use uncompress (the default) or GZIP .

26-2

Parallel Job Developers Guide

Expand Stage

Input Page

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. The stage has a mandatory partitioning method of Same, this overrides the preserve partitioning flag and so the partitioning of the incoming data is always preserved. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Input Page
The Inputs page allows you to specify details about the data set being expanded. There is only one input link. The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Expand stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Parallel Job Developers Guide

26-3

Output Page

Expand Stage

Partitioning on Input Links


The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the expansion is performed. By default the stage uses the Same partitioning method and this cannot be altered. This preserves the partitioning already in place. If the Expand stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following Collection methods are available: (Auto). This is the default collection method for the Expand stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab normally also allows you to specify that data arriving on the input link should be sorted before the expansion is performed. This facility is not available on the expand stage.

Output Page
The Outputs page allows you to specify details about data output from the Expand stage. The stage only has one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. See Chapter 3, "Stage Editors," for a general description of the tabs.

26-4

Parallel Job Developers Guide

27
Copy Stage
The Copy stage is a processing stage. It can have a single input link and any number of output links. The Copy stage copies a single input data set to a number of output data sets. Each record of the input data set is copied to every output data set. Records can be copied without modification or you can drop or change the order of columns (to copy with more modification for example changing column data types use the Modify stage as described in Chapter 28). Copy lets you make a backup copy of a data set on disk while performing an operation on another copy, for example.

Where you are using a Copy stage with a single input and a single output, you should ensure that you set the Force property in the stage editor TRUE. This prevents DataStage from deciding that the Copy operation is superfluous and optimizing it out of the job. The stage editor has three pages:

Parallel Job Developers Guide

27-1

Example

Copy Stage

Stage Page. This is always present and is used to specify general information about the stage. Input Page. This is where you specify details about the input link carrying the data to be copied. Outputs Page. This is where you specify details about the copied data being output from the stage.

Example
In this example we are going to copy data about the people who were allocated land in the village of Stewkley, Buckinghamshire by the 1812 enclosure award. We are going to copy it to three separate data sets, and in each case we are only copying a subset of the columns. The Copy stage will drop the unwanted columns as it copies the data set. The column definitions for the input data set are as follows:

27-2

Parallel Job Developers Guide

Copy Stage

Example

Here is the job that will perform the copying:

The Copy stage properties are fairly simple. The only property is Force, and we do not need to set it in this instance as we are copying to multiple data sets (and DataStage will not attempt to optimize it out of the job). We need to concentrate on telling DataStage which columns to drop on each output link. The easiest way to do this is using the Outputs page Mapping tab. When you open this for a link the left pane shows the input columns, simply drag the columns you

Parallel Job Developers Guide

27-3

Example

Copy Stage

want to preserve across to the right pane. We repeat this for each link as follows:

27-4

Parallel Job Developers Guide

Copy Stage

Example

When the job is run, three copies of the original data set are produced, each containing a subset of the original columns, but all of the rows. Here is some sample data from each of the data set on DSLink6, which gives us the name of each landholder and the amount of land they were allocated, both in the old measure of acres, roods and perches and as a decimal acreage:

Parallel Job Developers Guide

27-5

Must Dos

Copy Stage

Must Dos
DataStage has many defaults which means that it can be very easy to include Copy stages in a job. This section specifies the minimum steps to take to get a Copy stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Copy stage: Ensure that meta data has been defined for input link and output links. In the Outputs Page Mapping Tab, specify how the input columns of the data set being copied map onto the columns of the various output links.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. The Copy stage only has one property.
Category/ Property
Options/Force

Values
True/False

Default
False

Mandatory?
N

Repeats?
N

Dependent of
N/A

Options Category
Force Set True to specify that DataStage should not try to optimize the job by removing a Copy operation where there is one input and one output. Set False by default.

Advanced Tab
This tab allows you to specify the following:

27-6

Parallel Job Developers Guide

Copy Stage

Input Page

Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts the setting of the previous stage.You can explicitly select Set or Clear. Select Set to request the stage should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Input Page
The Inputs page allows you to specify details about the data set being copied. There is only one input link. The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Copy stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Parallel Job Developers Guide

27-7

Input Page

Copy Stage

Partitioning on Input Links


The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the copy is performed. By default the stage uses the auto partitioning method. If the Copy stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Copy stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Copy stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Copy stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the Copy stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place.

27-8

Parallel Job Developers Guide

Copy Stage

Input Page

DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for the Copy stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the remove duplicates operation is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you
Parallel Job Developers Guide 27-9

Outputs Page

Copy Stage

can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Copy stage. The stage can have any number of output links, choose the one you want to work on from the Output name drop down list. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Copy stage and the output columns. The Advanced tab allows you to change the default buffering settings for the output links. Details about Copy stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Mapping Tab
For Copy stages the Mapping tab allows you to specify how the output columns are derived, i.e., what copied columns map onto them.

27-10

Parallel Job Developers Guide

Copy Stage

Outputs Page

The left pane shows the copied columns. These are read only and cannot be modified on this tab. The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived.You can fill it in by dragging copied columns over, or by using the Auto-match facility. In the above example the left pane represents the incoming data after the copy has been performed. The right pane represents the data being output by the stage after the copy operation. In this example the data has been mapped straight across.

Parallel Job Developers Guide

27-11

Outputs Page

Copy Stage

27-12

Parallel Job Developers Guide

28
Modify Stage
The Modify stage is a processing stage. It can have a single input link and a single output link. The modify stage alters the record schema of its input data set. The modified data set is then output. You can drop or keep columns from the schema, or change the type of a column.

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Input Page. This is where you specify details about the input link. Outputs Page. This is where you specify details about the modified data being output from the stage.

Parallel Job Developers Guide

28-1

Examples

Modify Stage

Examples
Dropping and Keeping Columns
The following example takes a data set comprising the following columns:

The modify stage is used to drop the REPID, CREDITLIMIT, and COMMENTS columns. To do this, the stage properties are set as follows:

The easiest way to specify the outgoing meta data in this example would be to use runtime column propagation. You could, however,

28-2

Parallel Job Developers Guide

Modify Stage

Examples

choose to specify the meta data manually, in which case it would look like:

You could achieve the same effect by specifying which columns to keep, rather than which ones to drop. In the case of this example the required specification to use in the stage properties would be:
KEEP CUSTID, NAME, ADDRESS, CITY, STATE, ZIP, AREA, PHONE

Changing Data Type


You could also change the data types of one or more of the columns from the above example. Say you wanted to convert the CUSTID from decimal to string, you would specify a new column to take the converted data, and specify the conversion in the stage properties:

Parallel Job Developers Guide

28-3

Must Dos

Modify Stage

Some data type conversions require you to use a transform command, a list of these, and the available transforms, is given in "Specification" on page 28-5. The decimal to string conversion is one that can be performed using an explicit transform. In this case, the specification on the Properties page is as follows:
conv_CUSTID:string = string_from_decimal(CUSTID)

Null Handling
You can also use the Modify stage to handle columns that might contain null values. Any of the columns in the example, other than CUSTID, could legally contain a null value. You could use the modify stage to detect when the PHONE column contains a null value, and substitute the string NULL. In this case, the specification on the Properties page would be:
PHONE:string = NullToValue (PHONE,NULL)

Other null handling transforms are described in "Specification" on page 28-5.

Must Dos
DataStage has many defaults which means that it can be very easy to include Modify stages in a job. This section specifies the minimum steps to take to get a Modify stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a

28-4

Parallel Job Developers Guide

Modify Stage

Stage Page

particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Modify stage: In the Stage Page Properties Tab, supply the Modify specification. Ensure you have specified the meta data for the input and output columns

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. The modify stage only has one property, although you can repeat this as required.
Category/ Property
Options/ Specification

Values
string

Default
N/A

Mandatory?
Y

Repeats?
Y

Dependent of
N/A

Options Category
Specification This is a statement with one of the following the forms: DROP columnname [, columnname] KEEP columnname [, columnname] new_columnname [:new_type] = [explicit_conversion_function] old_columnname If you choose to drop a column or columns, all columns are retained except those you explicitly drop. If you chose to keep a column or columns, all columns are excluded except those you explicitly keep. If you specify multiple specifications each will be carried out sequentially.

Parallel Job Developers Guide

28-5

Stage Page

Modify Stage

Some type conversions DataStage can carry out automatically, others need you to specify an explicit conversion function. Some conversions are not available. The following table summarizes the availability, d indicates automatic (default) conversion, m indicates that manual conversion is required, a blank square indicates that conversion is not possible: Destination Field decimal
Source Field

int8

d, m d d, m d d, m d d, m d d, m d, m d, m d, m m m m

d, d m d d d d d d d d d d d d d d d d

d, m d d, m d d, m m d d d d, m d, m d, m d

uint8 int16 uint16 int32 uint32 int64 uint64 sfloat dfloat decimal string raw date time

d d d d d d d d d d d d d d d d d d d d, m

d d

d d d

d d d d

d d d d d

d d d d d d

d d d d d d d

d d d d d d d d d d d d d d, m d m

m m

d d d d d d, m d d d d, m d

d d d, m d d d d

d, d, m m d, d, m m d, d, m m

m m m

m m m

m m m

m d m m

m d, m d

timestamp m

28-6

Parallel Job Developers Guide

timestamp

uint32

uint64

uint16

dfloat

string

sfloat

uint8

int32

int64

int16

time

date

int8

raw

Modify Stage

Stage Page

For a default type conversion, your specification would take the following form:
new_columnname = old_columnname

For example:
int8col = uint64col

Where a manual conversion is required, your specification takes the form:


new_columnname:new_type = conversion_function (old_columnname)

For example:
day_column:int8 = month_day_from_date (date_column)

The new_type can be any of the destination types that are supported for conversions from the source (i.e., any of the columns marked m in the above table). For example, you can use the conversion hours_from_time to convert a time to an int8, or to an int16, int32, dfloat, and so on. DataStage warns you when it is performing an implicit data type conversion, for example hours_from_time expects to convert a time to an int8, and will warn you if converting to a int16, int32, or dfloat. The following table lists the available conversion functions in the form conversion_name [optional_arguments] (source_type). The destination type can be any of the supported types as described above.
Conversion
date_from_days_since (int32, date)

Description
Converts an integer field into a date by adding the integer to the specified base date.

Notes
The date must be in the format yyyy-mm-dd and must be either double quoted or a variable.

date_from_julian_day (uint32) date_from_string (string) [date_format]

Date from Julian day. Converts the string to a date representation using the specified date_format. Converts the timestamp to a date representation. By default the string format is yyyy-mm-dd. See page B-13 for and explanation of date_format. By default the string format is yyyy-mm-dd.

date_from_timestamp [date_format] (timestamp)

date_from_ustring (ustring) [date_format] Converts the string to a date representation using the specified date_format.

Parallel Job Developers Guide

28-7

Stage Page

Modify Stage

Conversion
days_since_from_date [source_date] (date)

Description
Returns a value corresponding to the number of days from source_date to the date in the source column.

Notes
source_date must be in the form yyyy-mm-dd and can be quoted or unquoted.

decimal_from_decimal [r_type](decimal) decimal_from_dfloat [r_type](dfloat) decimal_from_string [r_type](string) decimal_from_ustring [r_type](string) dfloat_from_decimal [fix_zero](decimal) hours_from_time (time) int32_from_decimal [r_type, fix_zero](decimal)

Decimal from decimal. Decimal from dfloat. Decimal from string.

See page B-12 for an explanation of r_type. See page B-12 for an explanation of r_type. See page B-12 for an explanation of r_type. See page B-12 for an explanation of r_type.

Dfloat from decimal. Hours from time. Int32 from decimal.

See page B-13 for an explanation of fix_zero.

See page B-12 for an explanation of r_type and page B-13 for an explanation of fix_zero. See page B-12 for an explanation of r_type and page B-13 for an explanation of fix_zero.

int64_from_decimal [r_type, fix_zero](decimal)

Int64 from decimal.

julian_day_from_date (date)

Julian day from date. See page 28-13 for an explanation of tableDefinition. See page 28-13 for an explanation of tableDefinition. See page 28-13 for an explanation of tableDefinition. See page 28-13 for an explanation of tableDefinition. See page 28-13 for an explanation of tableDefinition.

lookup_string_from_int16[tableDefinition] Converts numeric values to (int16) strings by means of a lookup table. lookup_ustring_from_int16 [tableDefinition] (int16) lookup_ustring_from_int32 [tableDefinition] (int32) lookup_string_from_uint32 [tableDefinition] (uint32) lookup_int16_from_string [tableDefinition](string) Converts numeric values to ustrings by means of a lookup table. Converts numeric values to ustrings by means of a lookup table. Converts numeric values to strings by means of a lookup table. Converts strings to numeric values by means of a lookup table.

28-8

Parallel Job Developers Guide

Modify Stage

Stage Page

Conversion
lookup_int16_from_ustring [tableDefinition](ustring) lookup_uint32_from_string [tableDefinition](string) lookup_uint32_from_ustring [tableDefinition](ustring) lowercase_string (string)

Description
Converts strings to numeric values by means of a lookup table. Converts strings to numeric values by means of a lookup table. Converts ustrings to numeric values by means of a lookup table. Convert strings to all lower case. Non-alphabetic characters are ignored in the conversion. Convert ustrings to all lower case. Non-alphabetic characters are ignored in the conversion. Returns the mantissa from the given decimal Returns the mantissa from the given dfloat Microseconds from time. Seconds-from-midnight from time. Minutes from time. Day of month from date. Month from date.

Notes
See page 28-13 for an explanation of tableDefinition. See page 28-13 for an explanation of tableDefinition. See page 28-13 for an explanation of tableDefinition.

lowercase_ustring (string)

mantissa_from_decimal (dfloat) mantissa_from_dfloat (decimal) microseconds_from_time (time) midnight_seconds_from_time (time)

minutes_from_time (time) month_day_from_date (date) month_from_date (date)

next_weekday_from_date [day](date)

The destination contains the date of the specified day of the week soonest after the source date (including the source date). day is a string specifying a day of the week.

You can specify day by either the first three characters of the day name or the full day name. The day can be quoted in either single or double quotes or quotes can be omitted.

notnull (any)

Returns true when an expression does not evaluate to the null value. Returns true when an expression does evaluate to the null value

null (any)

Parallel Job Developers Guide

28-9

Stage Page

Modify Stage

Conversion
previous_weekday_from_date[day] (date)

Description
The destination contains the closest date for the specified day of the week earlier than the source date (including the source date). The day is a string specifying a day of the week.

Notes
You can specify day by either the first three characters of the day name or the full day name. The day can be quoted in either single or double quotes or quotes can be omitted.

raw_from_string (string) raw_length (raw) seconds_from_time (time) seconds_since_from_timestamp [timestamp](timestamp) string_from_date [date_format](date)

Returns a string in raw representation Returns the length of a raw Seconds from time. Seconds since the time given by timestamp. Converts the date to a string representation using the specified date_format. By default, the string format is yyyy-mm-dd. See page B-13 for and explanation of date_format.

string_from_decimal [fix_zero](decimal) string_from_time [time_format](time)

String from decimal. Converts the time to a string representation using the specified time_format.

See page B-13 for an explanation of fix_zero. By default, the string format is %yyyy-%mm%dd hh:nn:ss. See page B-13 for and explanation of time_format.

string_from_timestamp [timestamp_format] (timestamp)

Converts the timestamp to a string representation using the specified timestamp_format.

By default, the string format is %yyyy-%mm%dd hh:nn:ss. See page B-13 for and explanation of time_format.

string_from_ustring (ustring) string_length (string)

Returns a string from a ustring. Returns an int32 containing the length of a string.

28-10

Parallel Job Developers Guide

Modify Stage

Stage Page

Conversion
substring [startPosition,len] (string)

Description
Converts long strings to shorter strings by string extraction. The startPosition specifies the starting location of the substring; len specifies the substring length.

Notes
If startPosition is positive, it specifies the byte offset into the string from the beginning of the string. If startPosition is negative, it specifies the byte offset from the end of the string.

time_from_midnight_seconds(dfloat) time_from_string [time_format](string)

Time from seconds-frommidnight. Converts the string to a time representation using the specified time_format. By default, the string format is %yyyy-%mm%dd hh:nn:ss. See page B-13 for and explanation of time_format.

time_from_timestamp (timestamp) time_from_ustring (ustring) timestamp_from_date [time](date)

Time from timestamp. Returns a time from a ustring. Timestamp from date. The time argument optionally specifies the time to be used in building the timestamp result and must be in the form hh:nn:ss. Timestamp from a seconds since value. Converts the string to a timestamp representation using the specified timestamp_format. By default, the string format is %yyyy-%mm%dd hh:nn:ss. See page B-13 for and explanation of time_format. If omitted, the time defaults to midnight.

timestamp_from_seconds_since [timestamp](dfloat) timestamp_from_string [timestamp_format] (string)

timestamp_from_time [date](time)

Timestamp from time. The date argument is required. It specifies the date portion of the timestamp and must be in the form yyyy-mmdd.

timestamp_from_ timet (int32)

Timestamp from time_t. The source field must contain a timestamp as defined by the UNIX time_t representation.

Parallel Job Developers Guide

28-11

Stage Page

Modify Stage

Conversion
timestamp_from_ustring (ustring) timet_from_timestamp (timestamp)

Description
Returns a timestamp from a ustring. Time_t from timestamp. The destination column contains a timestamp as defined by the UNIX time_t representation. Uint64 from decimal.

Notes

uint64_from_decimal [r_type, fix_zero](decimal)

See page B-12 for an explanation of r_type and page B-13 for an explanation of fix_zero.

uppercase_string (string)

Convert strings to all upper case. Non-alphabetic characters are ignored in the conversion. Convert ustrings to all upper case. Non-alphabetic characters are ignored in the conversion. Returns a raw from a ustring Returns a ustring from a date. Returns a ustring from a decimal. Returns a ustring from a string. Returns a ustring from a time. Returns a ustring from a timestamp. Returns the length of a ustring.

uppercase_ustring (ustring)

u_raw_from_string (ustring) ustring_from_date (date) ustring_from_decimal (decimal) ustring_from_string (string) ustring from time (time) ustring_from_timestamp (timestamp) ustring_length u_substring (ustring) weekday_from_date [originDay](date)

Day of week from date. originDay is a string specifying the day considered to be day zero of the week.

You can specify the day using either the first three characters of the day name or the full day name. If omitted, Sunday is defined as day zero. The originDay can be either single- or double-quoted or the quotes can be omitted.

year_day_from_date (date) year_from_date (date)

Day of year from date (returned value 1366). Year from date.

28-12

Parallel Job Developers Guide

Modify Stage

Stage Page

Conversion
year_week_from_date (date)

Description
Week of year from date.

Notes

tableDefinition defines the rows of a string lookup table and has the following form:
{propertyList} ('string' = value; 'string' = value; ... )

where: propertyList is one or more of the following options; the entire list is enclosed in braces and properties are separated by commas if there are more than one:

case_sensitive. Perform a case-sensitive search for matching strings; the default is case-insensitive. default_value = defVal. The default numeric value returned for a string that does not match any of the strings in the table. default_string = defString. The default string returned for numeric values that do not match any numeric value in the table.

string specifies a comma-separated list of strings associated with value; enclose each string in quotes. value specifies a comma-separated list of 16-bit integer values associated with string.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. If you have an input data set, it adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request the next stage should attempt to maintain the partitioning.

Parallel Job Developers Guide

28-13

Input Page

Modify Stage

Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Input Page
The Inputs page allows you to specify details about the incoming data set you are modifying. There is only one input link. The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Modify stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning on Input Links


The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the modify is performed. By default the stage uses the auto partitioning method. If the Modify stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Modify stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode.

28-14

Parallel Job Developers Guide

Modify Stage

Input Page

If the Modify stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Modify stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the Modify stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for the Modify stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operation starts over.

Parallel Job Developers Guide

28-15

Outputs Page

Modify Stage

Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the modify operation is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
See Chapter 3, "Stage Editors," for a general description of the output tabs.

28-16

Parallel Job Developers Guide

29
Filter Stage
The Filter stage is a processing stage. It can have a single input link and a any number of output links and, optionally, a single reject link. The Filter stage transfers, unmodified, the records of the input data set which satisfy the specified requirements and filters out all other records. You can specify different requirements to route rows down different output links. The filtered out records can be routed to a reject link, if required.

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Input Page. This is where you specify details about the input link carrying the data to be filtered. Outputs Page. This is where you specify details about the filtered data being output from the stage down the various output links.

Parallel Job Developers Guide

29-1

Specifying the Filter

Filter Stage

Specifying the Filter


The operation of the filter stage is governed by the expressions you set in the Where property on the Properties Tab. You can use the following elements to specify the expressions: Input columns. Requirements involving the contents of the input columns. Optional constants to be used in comparisons. The Boolean operators AND and OR to combine requirements. When a record meets the requirements, it is written unchanged to the specified output link. The Where property supports standard SQL expressions, except when comparing strings. When quoting in the filter, you should use single, not double, inverted commas.

Input Data Columns


If you specify a single column for evaluation, that column can be of any data type. Note that DataStages treatment of strings differs slightly from that of standard SQL. If you compare columns they must be of the same or compatible data types. Otherwise, the operation terminates with an error. Compatible data types are those that DataStage converts by default. Regardless of any conversions the whole row is transferred unchanged to the output. If the columns are not compatible upstream of the filter stage, you can convert the types by using a Modify stage prior to the Filter stage. Column data type conversion is based on the following rules: Any integer, signed or unsigned, when compared to a floatingpoint type, is converted to floating-point. Comparisons within a general type convert the smaller to the larger size (sfloat to dfloat, uint8 to uint16, etc.) When signed and unsigned integers are compared, unsigned are converted to signed. Decimal, raw, string, time, date, and timestamp do not figure in type conversions. When any of these is compared to another type, filter returns an error and terminates. The input field can contain nulls. If it does, null values are less than all non-null values, unless you specify the operatorss nulls last option.

29-2

Parallel Job Developers Guide

Filter Stage

Specifying the Filter

Note The conversion of numeric data types may result in a loss of range and cause incorrect results. DataStage displays a warning message to that effect when range is lost.

Supported Boolean Expressions and Operators


The following list summarizes the Boolean expressions that are supported. In the list, BOOLEAN denotes any Boolean expression. true false six comparison operators: =, <>, <, >, <=, >= is null is not null like 'abc' (the second operand must be a regular expression) between (for example, A between B and C is equivalent to B <= A and A => C) not BOOLEAN BOOLEAN is true BOOLEAN is false BOOLEAN is not true BOOLEAN is not false Any of these can be combined using AND or OR.

Order of Association
As in SQL, expressions are associated left to right. AND and OR have the same precedence. You may group fields and expressions in parentheses to affect the order of evaluation.

String Comparison
DataStage sorts string values according to these general rules: Characters are sorted in lexicographic order. Strings are evaluated by their ASCII value. Sorting is case sensitive, that is, uppercase letters appear before lowercase letter in sorted data. Null characters appear before non-null characters in a sorted data set, unless you specify the nulls last option.

Parallel Job Developers Guide

29-3

Specifying the Filter

Filter Stage

Byte-for-byte comparison is performed.

Examples
The following give some example Where properties.

Comparing Two Fields


You want to compare columns number1 and number2. If the data in column number1 is greater than the data in column number2, the corresponding records are to be written to output link 2. You enter the following in the Where property:
name1 > name2

You then select output link 2 in the dependent Output Link property. (You use the Link Ordering tab to specify the number order of the output links).

Testing for a Null


You want to test column serialno to see if it contains a null. If it does, you want to write the corresponding records to the output link. You enter the following in the Where property:
serialno is null

In this example the stage only has one output link. You do not need to specify the Output Link property because the stage will write to the output link by default.

Evaluating Input Columns


You want to evaluate each input row to see if these conditions prevail: EITHER all the following are true

Column number1 does not have the value 0 Column number2 does not have the value 3 Column number3 has the value 0

OR column name equals the string ZAG You enter the following in the Where property:
number1 <> 0 and number2 <> 3 and number3 = 0 or name = 'ZAG'

If these conditions are met, the stage writes the row to the output link.

29-4

Parallel Job Developers Guide

Filter Stage

Must Dos

Must Dos
DataStage has many defaults which means that it can be very easy to include Filter stages in a job. This section specifies the minimum steps to take to get a Filter stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Filter stage: In the Stage Page Properties Tab:

Supply the specifications that determine which records are accepted and which are filtered out. This is given in the form of a Where clause. You can multiple statements each applying to different links. Specify which Where clause correspond to which output links. Specify whether rows that fail to satisfy any of the Where clauses will be routed to a reject link. Specify whether rows are output only for the first Where clause they satisfy, or for any clauses they satisfy.

In the Stage Page Link Ordering Tab, specify which order the output links are processed in. This is important where you specify that rows are only output for the first Where clause that they satisfy. Ensure that meta data has been defined for input link and output links, and reject link, if applicable. In the Outputs Page Mapping Tab, specify how the input columns of the data set being filtered map onto the columns of the various output links.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link Ordering tab allows you to specify what order the output links are processed in. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules.

Parallel Job Developers Guide

29-5

Stage Page

Filter Stage

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Predicates/Where clause Predicates/Output link Options/Output rejects Options/Output rows only once Options/Nulls value

Values
string Output link True/False True/False Less Than/ Greater Than

Default
N/A N/A False False Less Than

Mandatory?
Y Y Y Y N

Repeats?
Y N N N N

Dependent of
N/A Where clause N/A N/A N/A

Predicates Category
Where clause Specify a Where statement that a row must satisfy in order to be routed down this link. This is like an SQL Where clause, see "Specifying the Filter" on page 29-2 for details. Output link Specify the output link corresponding to the Where clause.

Options Category
Output rejects Set this to true to output rows that satisfy no Where clauses down the reject link (remember to specify which link is the reject link on the parallel job canvas).

29-6

Parallel Job Developers Guide

Filter Stage

Stage Page

Output rows only once Set this to true to specify that rows are only output down the link of the first Where clause they satisfy. Set to false to have rows output down the links of all Where clauses that they satisfy. Nulls value Specify whether null values are treated as greater than or less than other values.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts the setting of the previous stage.You can explicitly select Set or Clear. Select Set to request the stage should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Parallel Job Developers Guide

29-7

Stage Page

Filter Stage

Link Ordering Tab


This tab allows you to specify the order in which output links are processed. This is important where you have set the Output rows only once property to True.

NLS Locale Tab


This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Filter stage uses this when evaluating Where clauses. Select a

29-8

Parallel Job Developers Guide

Filter Stage

Input Page

locale from the list, or click the arrow button next to the list to use a job parameter or browse for a collate file.

Input Page
The Inputs page allows you to specify details about the data set being filtered. There is only one input link. The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Filter stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning on Input Links


The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the filter is performed. By default the stage uses the auto partitioning method. If the Filter stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method.

Parallel Job Developers Guide

29-9

Input Page

Filter Stage

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Filter stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Filter stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Filter stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the Filter stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available:

29-10

Parallel Job Developers Guide

Filter Stage

Input Page

(Auto). This is the default collection method for the Filter stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the remove duplicates operation is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Parallel Job Developers Guide

29-11

Outputs Page

Filter Stage

Outputs Page
The Outputs page allows you to specify details about data output from the Filter stage. The stage can have any number of output links, plus one reject link, choose the one you want to work on from the Output name drop down list. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Filter stage and the output columns. The Advanced tab allows you to change the default buffering settings for the output links. Details about Filter stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Mapping Tab
For Filter stages the Mapping tab allows you to specify how the output columns are derived, i.e., what filtered columns map onto them.

The left pane shows the filtered columns. These are read only and cannot be modified on this tab. The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived.You can fill it in by dragging copied columns over, or by using the Auto-match facility.

29-12

Parallel Job Developers Guide

Filter Stage

Outputs Page

In the above example the left pane represents the incoming data after the filter has been performed. The right pane represents the data being output by the stage after the copy operation. In this example the data has been mapped straight across.

Parallel Job Developers Guide

29-13

Outputs Page

Filter Stage

29-14

Parallel Job Developers Guide

30
External Filter Stage
The External Filter stage is a processing stage. It can have a single input link and a single output link. The External Filter stage allows you to specify a UNIX command that acts as a filter on the data you are processing. An example would be to use the stage to grep a data set for a certain string, or pattern, and discard records which did not contain a match. This can be a quick and efficient way of filtering data.

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Input Page. This is where you specify details about the input link carrying the data to be filtered. Outputs Page. This is where you specify details about the filtered data being output from the stage.

Parallel Job Developers Guide

30-1

Must Dos

External Filter Stage

Must Dos
DataStage has many defaults which means that it can be very easy to include External Filter stages in a job. This section specifies the minimum steps to take to get an External Filter stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use an External Filter stage: In the Stage Page Properties Tab specify the filter command the stage will use. Optionally add arguments that the command requires.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/Property Values
Options/Filter Command Options/Arguments string string

Default
N/A N/A

Mandatory?
Y N

Repeats?
N N

Dependent of
N/A N/A

Options Category
Filter Command Specifies the filter command line to be executed and any command line options it requires. For example:
grep

30-2

Parallel Job Developers Guide

External Filter Stage

Input Page

Arguments Allows you to specify any arguments that the command line requires. For example:
\(cancel\).*\1

Together with the grep command would extract all records that contained the string cancel twice and discard other records.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts the setting of the previous stage.You can explicitly select Set or Clear. Select Set to request the next stage should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Input Page
The Inputs page allows you to specify details about the data set being filtered. There is only one input link.

Parallel Job Developers Guide

30-3

Input Page

External Filter Stage

The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about External Filter stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning on Input Links


The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the filter is executed. By default the stage uses the auto partitioning method. If the External Filter stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the External Filter stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the External Filter stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type dropdown list. This will override any current partitioning. If the External Filter stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the External Filter stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.
30-4 Parallel Job Developers Guide

External Filter Stage

Input Page

Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for the External Filter stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the remove duplicates operation is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default.

Parallel Job Developers Guide

30-5

Outputs Page

External Filter Stage

Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the External Filter stage. The stage can only have one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. See Chapter 3, "Stage Editors," for a general description of these tabs.

30-6

Parallel Job Developers Guide

31
Change Capture Stage
The Change Capture Stage is a processing stage. The stage compares two data sets and makes a record of the differences. The Change Capture stage takes two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set. The stage produces a change data set, whose table definition is transferred from the after data sets table definition with the addition of one column: a change code with values encoding the four actions: insert, delete, copy, and edit. The preserve-partitioning flag is set on the change data set. The compare is based on a set a set of key columns, rows from the two data sets are assumed to be copies of one another if they have the same values in these key columns. You can also optionally specify change values. If two rows have identical key columns, you can compare the value columns in the rows to see if one is an edited copy of the other. The stage assumes that the incoming data is key-partitioned and sorted in ascending order. The columns the data is hashed on should be the key columns used for the data compare. You can achieve the sorting and partitioning using the Sort stage or by using the built-in sorting and partitioning abilities of the Change Capture stage. You can use the companion Change Apply stage to combine the changes from the Change Capture stage with the original before data set to reproduce the after data set (see Chapter 32).

Parallel Job Developers Guide

31-1

Example Data

Change Capture Stage

The Change Capture stage is very similar to the Difference stage described in Chapter 33. The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify details about the data set having its duplicates removed. Outputs Page. This is where you specify details about the processed data being output from the stage.

Example Data
This example shows a before and after data set, and the data set that is output by the Change Capture stage when it has compared them. This is the before data set:

31-2

Parallel Job Developers Guide

Change Capture Stage

Must Dos

This is the after data set:

This is the data set output by the Change Capture stage (bcol4 is the key column, bcol1 the value column):

The change_code indicates that, in these three rows, the bcol1 column in the after data set has been edited. The bcol1 column carries the edited value.

Must Dos
DataStage has many defaults which means that it can be very easy to include Change Capture stages in a job. This section specifies the minimum steps to take to get a Change Capture stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Change Capture stage: In the Stage Page Properties Tab:

Parallel Job Developers Guide

31-3

Stage Page

Change Capture Stage

Specify the key column. You can repeat this property to specify a composite key. Before and after rows are considered to be the same if they have the same value in the key column or columns. Optionally specify one or more Value columns. This enables you to determine if an after row is an edited version of a before row.

(You can also set the Change Mode property to have DataStage treat all columns not defined as keys treated as values, or all columns not defined as values treated as keys.)

Specify whether the stage will output the changed row or drop it. You can specify this individually for each type of change (copy, delete, edit, or insert).

In the Stage Page Link Ordering Tab, specify which of the two links carries the before data set and which carries the after data set. If the two incoming data sets arent already key partitioned on the key columns and sorted, set DataStage to do this on the Inputs Page Partitioning Tab. In the Outputs Page Mapping Tab, specify how the change data columns are mapped onto the output link columns.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link Ordering tab allows you to specify which input link carries the before data set and which the after data set. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

31-4

Parallel Job Developers Guide

Change Capture Stage

Stage Page

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Change Keys/Key Change Keys/Case Sensitive Change Keys/Sort Order

Values
Input Column True/False Ascending/ Descending

Default
N/A True Ascendin g First N/A True Explicit Keys & Values

Mandatory?
Y N N N N N Y

Repeats?
Y N N N Y N N

Dependent of
N/A Key Key Key N/A Value N/A

Change Keys/Nulls First/Last Position Change Values/ Value Change Values/ Case Sensitive Options/Change Mode Input Column True/False Explicit Keys & Values/All keys, Explicit values/ Explicit Keys, All Values True/False True/False True/False True/False True/False string number number number number

Options/Log Statistics Options/Drop Output for Insert Options/Drop Output for Delete Options/Drop Output for Edit Options/Drop Output for Copy Options/Code Column Name Options/Copy Code Options/Deleted Code Options/Edit Code Options/Insert Code

False False False False True change_ code 0 2 3 1

N N N N N N N N N N

N N N N N N N N N N

N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A

Parallel Job Developers Guide

31-5

Stage Page

Change Capture Stage

Change Keys Category


Key Specifies the name of a difference key input column (see page 31-1 for an explanation of how Key columns are used). This property can be repeated to specify multiple difference key input columns. You can use the Column Selection dialog box to select several keys at once if required (see page 3-10). Key has the following dependent properties: Case Sensitive Use this property to specify whether each key is case sensitive or not. It is set to True by default; for example, the values CASE and case would not be judged equivalent. Sort Order Specify ascending or descending sort order. Nulls Position Specify whether null values should be placed first or last.

Change Value category


Value Specifies the name of a value input column (see page 31-1 for an explanation of how Value columns are used). You can use the Column Selection dialog box to select values at once if required (see page 3-10). Value has the following dependent properties: Case Sensitive Use this to property to specify whether each value is case sensitive or not. It is set to True by default; for example, the values CASE and case would not be judged equivalent.

Options Category
Change Mode This mode determines how keys and values are specified. Choose Explicit Keys & Values to specify the keys and values yourself. Choose All keys, Explicit values to specify that value columns must be defined, but all other columns are key columns unless excluded. Choose Explicit Keys, All Values to specify that key columns must be defined but all other columns are value columns unless they are excluded.

31-6

Parallel Job Developers Guide

Change Capture Stage

Stage Page

Log Statistics This property configures the stage to display result information containing the number of input rows and the number of copy, delete, edit, and insert rows. Drop Output for Insert Specifies to drop (not generate) an output row for an insert result. By default, an output row is always created by the stage. Drop Output for Delete Specifies to drop (not generate) the output row for a delete result. By default, an output row is always created by the stage. Drop Output for Edit Specifies to drop (not generate) the output row for an edit result. By default, an output row is always created by the stage. Drop Output for Copy Specifies to drop (not generate) the output row for a copy result. By default, an output row is not created by the stage. Code Column Name Allows you to specify a different name for the output column carrying the change code generated for each record by the stage. By default the column is called change_code. Copy Code Allows you to specify an alternative value for the code that indicates the after record is a copy of the before record. By default this code is 0. Deleted Code Allows you to specify an alternative value for the code that indicates that a record in the before set has been deleted from the after set. By default this code is 2. Edit Code Allows you to specify an alternative value for the code that indicates the after record is an edited version of the before record. By default this code is 3.

Parallel Job Developers Guide

31-7

Stage Page

Change Capture Stage

Insert Code Allows you to specify an alternative value for the code that indicates a new record has been inserted in the after set that did not exist in the before set. By default this code is 1.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

31-8

Parallel Job Developers Guide

Change Capture Stage

Stage Page

Link Ordering Tab


This tab allows you to specify which input link carries the before data set and which carries the after data set.

By default the first link added will represent the before set. To rearrange the links, choose an input link and click the up arrow button or the down arrow button.

NLS Locale Tab


This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Change Capture stage uses this when it is determining the sort order for key columns. Select a locale from the list, or click the arrow

Parallel Job Developers Guide

31-9

Inputs Page

Change Capture Stage

button next to the list to use a job parameter or browse for a collate file.

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Change Capture expects two incoming data sets: a before data set and an after data set. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being compared. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Change Capture stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is compared. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of

31-10

Parallel Job Developers Guide

Change Capture Stage

Inputs Page

current and preceding stages and how many nodes are specified in the Configuration file. In the case of the Change Capture stage, DataStage will determine if the incoming data is key partitioned. If it is, the Same method is used, if not, DataStage will hash partition the data and sort it. You could also explicitly choose hash and take advantage of the on-stage sorting. If the Change Capture stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Change Capture stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Change Capture stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type dropdown list. This will override any current partitioning. If the Change Capture stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Change Capture stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place.

Parallel Job Developers Guide

31-11

Inputs Page

Change Capture Stage

DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Change Capture stages. For the Change Capture stage, DataStage will ensure that the data is sorted as it is collected. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being compared. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for
31-12 Parallel Job Developers Guide

Change Capture Stage

Outputs Page

partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Change Capture stage. The Change Capture stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Change Capture stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output link. Details about Change Capture stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Mapping Tab
For the Change Capture stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them and which column carries the change code data.

Parallel Job Developers Guide

31-13

Outputs Page

Change Capture Stage

The left pane shows the columns from the before/after data sets plus the change code column. These are read only and cannot be modified on this tab. The right pane shows the output columns for each link. This has a Derivations field where you can specify how the column is derived.You can fill it in by dragging input columns over, or by using the Auto-match facility. By default the data set columns are mapped automatically. You need to ensure that there is an output column to carry the change code and that this is mapped to the Change_code column.

31-14

Parallel Job Developers Guide

32
Change Apply Stage
The Change Apply stage is a processing stage. It takes the change data set, that contains the changes in the before and after data sets, from the Change Capture stage and applies the encoded change operations to a before data set to compute an after data set. (See Chapter 31 for a description of the Change Capture stage.) The before input to Change Apply must have the same columns as the before input that was input to Change Capture, and an automatic conversion must exist between the types of corresponding columns. In addition, results are only guaranteed if the contents of the before input to Change Apply are identical (in value and record order in each partition) to the before input that was fed to Change Capture, and if the keys are unique.
Note The change input to Change Apply must have been output from Change Capture without modification. Because preserve-partitioning is set on the change output of Change Capture, you will be warned at run time if the Change Apply stage does not have the same number of partitions as the Change Capture stage. Additionally, both inputs of Change Apply are designated as partitioned using the Same partitioning method.

Parallel Job Developers Guide

32-1

Change Apply Stage

The Change Apply stage reads a record from the change data set and from the before data set, compares their key column values, and acts accordingly: If the before keys come before the change keys in the specified sort order, the before record is copied to the output. The change record is retained for the next comparison. If the before keys are equal to the change keys, the behavior depends on the code in the change_code column of the change record:

Insert: The change record is copied to the output; the stage retains the same before record for the next comparison. If key columns are not unique, and there is more than one consecutive insert with the same key, then Change Apply applies all the consecutive inserts before existing records. This record order may be different from the after data set given to Change Capture. Delete: The value columns of the before and change records are compared. If the value columns are the same or if the Check Value Columns on Delete is specified as False, the change and before records are both discarded; no record is transferred to the output. If the value columns are not the same, the before record is copied to the output and the stage retains the same change record for the next comparison. If key columns are not unique, the value columns ensure that the correct record is deleted. If more than one record with the same keys have matching value columns, the first-encountered record is deleted. This may cause different record ordering than in the after data set given to the Change Capture stage. A warning is issued and both change record and before record are discarded, i.e. no output record results.

Edit: The change record is copied to the output; the before record is discarded. If key columns are not unique, then the first before record encountered with matching keys will be edited. This may be a different record from the one that was edited in the after data set given to the Change Capture stage. A warning is issued and the change record is copied to the output; but the stage retains the same before record for the next comparison. Copy: The change record is discarded. The before record is copied to the output.

If the before keys come after the change keys, behavior also depends on the change_code column:

32-2

Parallel Job Developers Guide

Change Apply Stage

Example Data

Insert. The change record is copied to the output, the stage retains the same before record for the next comparison. (The same as when the keys are equal.) Delete. A warning is issued and the change record discarded while the before record is retained for the next comparison. Edit or Copy. A warning is issued and the change record is copied to the output while the before record is retained for the next comparison.

Note If the before input of Change Apply is identical to the before input of Change Capture and either the keys are unique or copy records are used, then the output of Change Apply is identical to the after input of Change Capture. However, if the before input of Change Apply is not the same (different record contents or ordering), or the keys are not unique and copy records are not used, this is not detected and the rules described above are applied anyway, producing a result that might or might not be useful.

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Example Data
This example shows a before and change data set, and the data set that is output by the Change Apply stage when it has compared them.

Parallel Job Developers Guide

32-3

Example Data

Change Apply Stage

This is the before data set:

This is the change data set, as output by a Change Capture stage:

This is the after data set, output by the Change Apply stage (bcol4 is the key column, bcol1 the value column):

32-4

Parallel Job Developers Guide

Change Apply Stage

Must Dos

Must Dos
DataStage has many defaults which means that it can be very easy to include Change Apply stages in a job. This section specifies the minimum steps to take to get a Change Apply stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Change Apply stage: In the Stage Page Properties Tab:

Specify the key column. You can repeat this property to specify a composite key. Before and change rows are considered to be the same if they have the same value in the key column or columns. Optionally specify one or more Value columns.

(You can also set the Change Mode property to have DataStage treat all columns not defined as keys treated as values, or all columns not defined as values treated as keys.) In the Stage Page Link Ordering Tab, specify which of the two links carries the before data set and which carries the change data set. In the Outputs Page Mapping Tab, specify how the change data columns are mapped onto the output link columns.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

Parallel Job Developers Guide

32-5

Stage Page

Change Apply Stage

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Change Keys/Key Change Keys/Case Sensitive Change Keys/Sort Order

Values
Input Column True/False Ascending/ Descending

Default
N/A True Ascending First N/A True Explicit Keys & Values

Mandatory? Repeats?
Y N N N N N Y Y N N N Y N N

Dependent of
N/A Key Key Key N/A Value N/A

Change Keys/Nulls First/Last Position Change Values/ Value Change Values/ Case Sensitive Options/Change Mode Input Column True/False Explicit Keys & Values/All keys, Explicit values/ Explicit Keys, All Values True/False True/False

Options/Log Statistics Options/Check Value Columns on Delete Options/Code Column Name Options/Copy Code Options/Deleted Code Options/Edit Code Options/Insert Code

False True

N Y

N N

N/A N/A

string number number number number

change_ code 0 2 3 1

N N N N N

N N N N N

N/A N/A N/A N/A N/A

Change Keys Category


Key Specifies the name of a difference key input column. This property can be repeated to specify multiple difference key input columns. You can

32-6

Parallel Job Developers Guide

Change Apply Stage

Stage Page

use the Column Selection dialog box to select several keys at once if required (see page 3-10). Key has the following dependent properties: Case Sensitive Use this to property to specify whether each key is case sensitive or not. It is set to True by default; for example, the values CASE and case would not be judged equivalent. Sort Order Specify ascending or descending sort order. Nulls Position Specify whether null values should be placed first or last.

Change Value category


Value Specifies the name of a value input column (see page 32-2 for an explanation of how Value columns are used). You can use the Column Selection dialog box to select several values at once if required (see page 3-10). Value has the following dependent properties: Case Sensitive Use this to property to specify whether each value is case sensitive or not. It is set to True by default; for example, the values CASE and case would not be judged equivalent.

Options Category
Change Mode This mode determines how keys and values are specified. Choose Explicit Keys & Values to specify the keys and values yourself. Choose All keys, Explicit values to specify that value columns must be defined, but all other columns are key columns unless excluded. Choose Explicit Keys, All Values to specify that key columns must be defined but all other columns are value columns unless they are excluded. Log Statistics This property configures the stage to display result information containing the number of input records and the number of copy, delete, edit, and insert records.

Parallel Job Developers Guide

32-7

Stage Page

Change Apply Stage

Check Value Columns on Delete Specifies that DataStage should not check value columns on deletes. Normally, Change Apply compares the value columns of delete change records to those in the before record to ensure that it is deleting the correct record. Code Column Name Allows you to specify that a different name has been used for the change data set column carrying the change code generated for each record by the stage. By default the column is called change_code. Copy Code Allows you to specify an alternative value for the code that indicates a record copy. By default this code is 0. Deleted Code Allows you to specify an alternative value for the code that indicates a record delete. By default this code is 2. Edit Code Allows you to specify an alternative value for the code that indicates a record edit. By default this code is 3. Insert Code Allows you to specify an alternative value for the code that indicates a record insert. By default this code is 1.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage.

32-8

Parallel Job Developers Guide

Change Apply Stage

Stage Page

Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Link Ordering Tab


This tab allows you to specify which input link carries the before data set and which carries the change data set.

By default the first link added will represent the before set. To rearrange the links, choose an input link and click the up arrow button or the down arrow button.

Parallel Job Developers Guide

32-9

Inputs Page

Change Apply Stage

NLS Locale Tab


This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Change Apply stage uses this when it is determining the sort order for key columns. Select a locale from the list, or click the arrow button next to the list to use a job parameter or browse for a collate file.

Inputs Page
The Inputs page allows you to specify details about the incoming data set. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being compared. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Change Apply stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

32-10

Parallel Job Developers Guide

Change Apply Stage

Inputs Page

Partitioning Tab
The change input to Change Apply should have been output from the Change Capture stage without modification and should have the same number of partitions. Additionally, both inputs of Change Apply are automatically designated as partitioned using the Same partitioning method. The standard partitioning and collecting controls are available on the Change Apply stage, however, so you can override this behavior. If the Change Apply stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method. The Partitioning tab allows you to override the default behavior. The exact operation of this tab depends on: Whether the Change Apply stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Change Apply stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type dropdown list. This will override any current partitioning. If the Change Apply stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the Change Apply stage, and will apply the Same method. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage.

Parallel Job Developers Guide

32-11

Inputs Page

Change Apply Stage

Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for the Change Apply stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the operation is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort.

32-12

Parallel Job Developers Guide

Change Apply Stage

Outputs Page

You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Change Apply stage. The Change Apply stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data.The Mapping tab allows you to specify the relationship between the columns being input to the Change Apply stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output link. Details about Change Apply stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Mapping Tab
For the Change Capture stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

Parallel Job Developers Guide

32-13

Outputs Page

Change Apply Stage

The left pane shows the common columns of the before and change data sets. These are read only and cannot be modified on this tab. The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Automatch facility. By default the columns are mapped straight across as shown in the example.

32-14

Parallel Job Developers Guide

33
Difference Stage
The Difference stage is a processing stage. It performs a record-byrecord comparison of two input data sets, which are different versions of the same data set designated the before and after data sets. The Difference stage outputs a single data set whose records represent the difference between them. The stage assumes that the input data sets have been key-partitioned and sorted in ascending order on the key columns you specify for the Difference stage comparison. You can achieve this by using the Sort stage or by using the built in sorting and partitioning abilities of the Difference stage. The comparison is performed based on a set of difference key columns. Two records are copies of one another if they have the same value for all difference keys. You can also optionally specify change values. If two records have identical key columns, you can compare the value columns to see if one is an edited copy of the other. The Difference stage is similar, but not identical, to the Change Capture stage described in Chapter 31. The Change Capture stage is intended to be used in conjunction with the Change Apply stage (Chapter 32); it produces a change data set which contains changes that need to be applied to the before data set to turn it into the after data set. The Difference stage outputs the before and after rows to the output data set, plus a code indicating if there are differences. Usually, the before and after data will have the same column names, in which case the after data set effectively overwrites the before data set and so you only see one set of columns in the output. You are warned that DataStage is doing this. If your before and after data sets have different column names, columns from both data sets are output; note that any key and value columns must have the same name.

Parallel Job Developers Guide

33-1

Example Data

Difference Stage

The stage generates an extra column, Diff, which indicates the result of each record comparison.

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify details about the data set having its duplicates removed. Outputs Page. This is where you specify details about the processed data being output from the stage.

Example Data
This example shows a before and after data set, and the data set that is output by the Difference stage when it has compared them. This is the before data set:

33-2

Parallel Job Developers Guide

Difference Stage

Must Dos

This is the after data set:

This is the data set output by the Difference stage (Key is the key column, All non-key columns are values is set True, all other settings take the default):

The diff column indicates that rows b, e, and f have been edited in the after data set (the rows output carry the data after editing).

Must Dos
DataStage has many defaults which means that it can be very easy to include Difference stages in a job. This section specifies the minimum steps to take to get a Difference stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Difference stage: In the Stage Page Properties Tab:

Parallel Job Developers Guide

33-3

Stage Page

Difference Stage

Specify the key column. You can repeat this property to specify a composite key. Before and after rows are considered to be the same if they have the same value in the key column or columns. Optionally specify one or more Difference Value columns. This enables you to determine if an after row is an edited version of a before row.

(You can also set the All non-Key columns are Values property to have DataStage treat all columns not defined as keys treated as values.)

Specify whether the stage will output the changed row or drop it. You can specify this individually for each type of change (copy, delete, edit, or insert).

In the Stage Page Link Ordering Tab, specify which of the two links carries the before data set and which carries the after data set. If the two incoming data sets arent already hash partitioned on the key columns and sorted, set DataStage to do this on the Inputs Page Partitioning Tab. In the Outputs Page Mapping Tab, specify how the difference columns are mapped onto the output link columns.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link Ordering tab allows you to specify which input link carries the before data set and which the after data set. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

33-4

Parallel Job Developers Guide

Difference Stage

Stage Page

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Difference Keys/Key Difference Keys/Case Sensitive Difference Values/All non-Key Columns are Values Difference Values/ Case Sensitive

Values
Input Column True/False True/False

Default
N/A True False

Mandatory?
Y N Y

Repeats?
Y N N

Dependent of
N/A Key N/A

True/False

True

All non-Key Columns are Values (when true) N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A

Options/Tolerate Unsorted Inputs

True/False

False False False False False False 0 2 3 1

N N N N N N N N N N

N N N N N N N N N N

Options/Log Statistics True/False Options/Drop Output for Insert Options/Drop Output for Delete Options/Drop Output for Edit Options/Drop Output for Copy Options/Copy Code True/False True/False True/False True/False number

Options/Deleted Code number Options/Edit Code Options/Insert Code number number

Difference Keys Category


Key Specifies the name of a difference key input column. This property can be repeated to specify multiple difference key input columns. You can use the Column Selection dialog box to select several keys at once if required (see page 3-10). Key has this dependent property:

Parallel Job Developers Guide

33-5

Stage Page

Difference Stage

Case Sensitive Use this to property to specify whether each key is case sensitive or not. It is set to True by default; for example, the values CASE and case would not be judged equivalent.

Difference Values Category


All non-Key Columns are Values Set this to True to indicate that any columns not designated as difference key columns are value columns (see page 33-1 for a description of value columns). It is False by default. The property has this dependent property: Case Sensitive Use this to property to specify whether each value is case sensitive or not. It is set to True by default; for example, the values CASE and case would not be judged equivalent. This property is only available if the All non-Key columns are values property is set to True.

Options Category
Tolerate Unsorted Inputs Specifies that the input data sets are not sorted. This property allows you to process groups of records that may be arranged by the difference key columns but not sorted. The stage processed the input records in the order in which they appear on its input. It is False by default. Log Statistics This property configures the stage to display result information containing the number of input records and the number of copy, delete, edit, and insert records. It is False by default. Drop Output for Insert Specifies to drop (not generate) an output record for an insert result. By default, an output record is always created by the stage. Drop Output for Delete Specifies to drop (not generate) the output record for a delete result. By default, an output record is always created by the stage.

33-6

Parallel Job Developers Guide

Difference Stage

Stage Page

Drop Output for Edit Specifies to drop (not generate) the output record for an edit result. By default, an output record is always created by the stage. Drop Output for Copy Specifies to drop (not generate) the output record for a copy result. By default, an output record is always created by the stage. Copy Code Allows you to specify an alternative value for the code that indicates the after record is a copy of the before record. By default this code is 2. Deleted Code Allows you to specify an alternative value for the code that indicates that a record in the before set has been deleted from the after set. By default this code is 1. Edit Code Allows you to specify an alternative value for the code that indicates the after record is an edited version of the before record. By default this code is 3. Insert Code Allows you to specify an alternative value for the code that indicates a new record has been inserted in the after set that did not exist in the before set. By default this code is 0.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage.

Parallel Job Developers Guide

33-7

Stage Page

Difference Stage

Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Link Ordering Tab


This tab allows you to specify which input link carries the before data set and which carries the after data set.

By default the first link added will represent the before set. To rearrange the links, choose an input link and click the up arrow button or the down arrow button.

33-8

Parallel Job Developers Guide

Difference Stage

Inputs Page

NLS Locale Tab


This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Difference stage uses this when it is determining the sort order for key columns. Select a locale from the list, or click the arrow button next to the list to use a job parameter or browse for a collate file.

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Difference stage expects two incoming data sets: a before data set and an after data set. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being compared. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Difference stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Parallel Job Developers Guide

33-9

Inputs Page

Difference Stage

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the operation is performed. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. For a Difference stage, DataStage checks to see if the incoming data is key-partitioned and sorted. If it is, the Same method is used, if not, DataStage will key partition the data and sort it. You could also explicitly choose hash or modulus partitioning methods and take advantage of the on-stage sorting. If the Difference stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Difference stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Difference stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Difference stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Difference stage. If the incoming data is already key-partitioned and sorted, DataStage will use the Same method. Otherwise it will key partition and sort for you. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

33-10

Parallel Job Developers Guide

Difference Stage

Inputs Page

Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button The following Collection methods are available: (Auto). This is the default collection method for Difference stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. For the Difference stage, DataStage will ensure that the data is sorted as it is collected. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the operation is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

Parallel Job Developers Guide

33-11

Outputs Page

Difference Stage

Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Difference stage. The Difference stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Difference stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output link. Details about Difference stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

33-12

Parallel Job Developers Guide

Difference Stage

Outputs Page

Mapping Tab
For the Difference stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the columns from the before/after data sets plus the DiffCode column. These are read only and cannot be modified on this tab. The right pane shows the output columns for each link. This has a Derivations field where you can specify how the column is derived.You can fill it in by dragging input columns over, or by using the Auto-match facility. By default the data set columns are mapped automatically. You need to ensure that there is an output column to carry the change code and that this is mapped to the DiffCode column.

Parallel Job Developers Guide

33-13

Outputs Page

Difference Stage

33-14

Parallel Job Developers Guide

34
Compare Stage
The Compare stage is a processing stage. It can have two input links and a single output link. The Compare stage performs a column-by-column comparison of records in two presorted input data sets. You can restrict the comparison to specified key columns. The Compare stage does not change the table definition, partitioning, or content of the records in either input data set. It transfers both data sets intact to a single output data set generated by the stage. The comparison results are also recorded in the output data set. We recommend that you use runtime column propagation in this stage and allow DataStage to define the output column schema for you. The stage outputs a data set with three columns: result. Carries the code giving the result of the comparison. first. A subrecord containing the columns of the first input link. second. A subrecord containing the columns of the second input link. If you specify the output link meta data yourself, you should use fully qualified names for the column definitions (e.g. first.col1, second.col1 etc.), because DataStage will not let you specify two lots of identical column names.

Parallel Job Developers Guide

34-1

Example Data

Compare Stage

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Example Data
This example shows two data sets being compared, and the data set that is output by the Compare stage when it has compared them. This is the first data set:

34-2

Parallel Job Developers Guide

Compare Stage

Must Dos

This is the second data set:

The stage compares on the Key columns bcol1 and bcol4. This is the output data set:
Result 0 2 0 0 2 -1 0 0 0 0 bcol0 0 1 2 3 4 5 6 7 8 9 bcol1 0 7 2 3 5 2 6 7 8 9 First bcol2 bcol3 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 bcol4 a b c d e f g h i j bcol0 0 1 2 3 4 5 6 7 8 9 Second bcol1 bcol2 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 bcol3 0 1 2 3 4 5 6 7 8 9 col4 a b c d e f g h i j

Must Dos
DataStage has many defaults which means that it can be very easy to include Compare stages in a job. This section specifies the minimum steps to take to get a Compare stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Compare stage: In the Stage Page Properties Tab, check that the default settings are suitable for your requirements. In the Stage Page Link Ordering Tab, specify which of your input links is the first link and which is the second.

Parallel Job Developers Guide

34-3

Stage Page

Compare Stage

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Options/Abort On Difference Options/Warn on Record Count Mismatch Options/Equals Value Options/First is Empty Value Options/Greater Than Value Options/Less Than Value Options/Second is Empty Value Options/Key Options/Case Sensitive

Values
True/False True/False

Default
False False

Mandatory?
Y Y

Repeats?
N N

Dependent of
N/A N/A

number number number number number Input Column True/False

0 1 2 -1 -2 N/A True

N N N N N N N

N N N N N Y N

N/A N/A N/A N/A N/A N/A Key

34-4

Parallel Job Developers Guide

Compare Stage

Stage Page

Options Category
Abort On Difference This property forces the stage to abort its operation each time a difference is encountered between two corresponding columns in any record of the two input data sets. This is False by default, if you set it to True you cannot set Warn on Record Count Mismatch. Warn on Record Count Mismatch This property directs the stage to output a warning message when a comparison is aborted due to a mismatch in the number of records in the two input data sets. This is False by default, if you set it to True you cannot set Abort on difference. Equals Value Allows you to set an alternative value for the code which the stage outputs to indicate two compared records are equal. This is 0 by default. First is Empty Value Allows you to set an alternative value for the code which the stage outputs to indicate the first record is empty. This is 1 by default. Greater Than Value Allows you to set an alternative value for the code which the stage outputs to indicate the first record is greater than the other. This is 2 by default. Less Than Value Allows you to set an alternative value for the code which the stage outputs to indicate the second record is greater than the other. This is -1 by default. Second is Empty Value Allows you to set an alternative value for the code which the stage outputs to indicate the second record is empty. This is -2 by default. Key Allows you to specify one or more key columns. Only these columns will be compared. Repeat the property to specify multiple columns. You can use the Column Selection dialog box to select several keys

Parallel Job Developers Guide

34-5

Stage Page

Compare Stage

at once if required (see page 3-10). The Key property has a dependent property: Case Sensitive Use this to specify whether each key is case sensitive or not, this is set to True by default, i.e., the values CASE and case in would end up in different groups.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

34-6

Parallel Job Developers Guide

Compare Stage

Stage Page

Link Ordering Tab


This tab allows you to specify which input link carries the First data set and which carries the Second data set. Which is categorized as first and which second affects the setting of the comparison code.

By default the first link added will represent the First set. To rearrange the links, choose an input link and click the up arrow button or the down arrow button.

NLS Locale Tab


This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Compare stage uses this when it is determining the sort order for

Parallel Job Developers Guide

34-7

Inputs Page

Compare Stage

key columns. Select a locale from the list, or click the arrow button next to the list to use a job parameter or browse for a collate file.

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Compare stage expects two incoming data sets. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being compared. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Compare stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
If you are running the Compare stage in parallel you must ensure that the incoming data is suitably partitioned and sorted to make a comparison sensible. The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is compared. It also allows you to specify that the data should be sorted before being operated on.

34-8

Parallel Job Developers Guide

Compare Stage

Inputs Page

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the Compare stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Compare stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Compare stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Compare stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Compare stage, and will ensure that incoming data is key partitioned and sorted. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .

Parallel Job Developers Guide

34-9

Inputs Page

Compare Stage

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Compare stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. For the Compare stage, DataStage will ensure that the data is sorted as it is collected. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. If you are collecting data, the Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being collected and compared. The sort is always carried out within data partitions. The sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

34-10

Parallel Job Developers Guide

Compare Stage

Outputs Page

Outputs Page
The Outputs page allows you to specify details about data output from the Compare stage. The Compare stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. See Chapter 3, "Stage Editors," for a general description of the tabs.

Parallel Job Developers Guide

34-11

Outputs Page

Compare Stage

34-12

Parallel Job Developers Guide

35
Encode Stage
The Encode stage is a processing stage. It encodes a data set using a UNIX encoding command, such as gzip, that you supply. The stage converts a data set from a sequence of records into a stream of raw binary data. The companion Decode stage reconverts the data stream to a data set (see Chapter 36). An encoded data set is similar to an ordinary one, and can be written to a data set stage. You cannot use an encoded data set as an input to stages that performs column-based processing or re-orders rows, but you can input it to stages such as Copy. You can view information about the data set in the data set viewer, but not the data itself. You cannot repartition an encoded data set, and you will be warned at runtime if your job attempts to do that. As the output is always a single stream, you do not have to define meta data for the output link.

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records.

Parallel Job Developers Guide

35-1

Must Dos

Encode Stage

Outputs Page. This is where you specify details about the processed data being output from the stage.

Must Dos
DataStage has many defaults which means that it can be very easy to include Encode stages in a job. This section specifies the minimum steps to take to get an Encode stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use an Encode stage: In the Stage Page Properties Tab, specify the UNIX command that will be used to encode the data, together with any required arguments. The command should expect its input from STDIN and send its output to STDOUT.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. This stage only has one property and you must supply a value for this. The property appears in the warning color (red by default) until you supply a value.
Category/ Property Values Default
N/A

Mandatory?
Y

Repeats?
N

Dependent of
N/A

Options/Command Command Line Line

Options Category
Command Line Specifies the command line used for encoding the data set. The command line must configure the UNIX command to accept input

35-2

Parallel Job Developers Guide

Encode Stage

Inputs Page

from standard input and write its results to standard output. The command must be located in your search path and be accessible by every processing node on which the Encode stage executes.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Set by default to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Encode stage can only have one input link. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being encoded. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link.

Parallel Job Developers Guide

35-3

Inputs Page

Encode Stage

Details about Encode stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is encoded. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the Encode stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Encode stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Encode stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Encode stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Encode stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

35-4

Parallel Job Developers Guide

Encode Stage

Inputs Page

Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Encode stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being encoded. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

Parallel Job Developers Guide

35-5

Outputs Page

Encode Stage

If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Encode stage. The Encode stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab allows you to specify column definitions for the data (although this is optional for an encode stage). The Advanced tab allows you to change the default buffering settings for the output link. See Chapter 3, "Stage Editors," for a general description of these tabs.

35-6

Parallel Job Developers Guide

36
Decode Stage
The Decode stage is a processing stage. It decodes a data set using a UNIX decoding command, such as gzip, that you supply. It converts a data stream of raw binary data into a data set. Its companion stage, Encode, converts a data set from a sequence of records to a stream of raw binary data (see Chapter 35). As the input is always a single stream, you do not have to define meta data for the input link.

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Must Dos
DataStage has many defaults which means that it can be very easy to include Decode stages in a job. This section specifies the minimum

Parallel Job Developers Guide

36-1

Stage Page

Decode Stage

steps to take to get a Decode stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Decode stage: In the Stage Page Properties Tab, specify the UNIX command that will be used to decode the data, together with any required arguments. The command should expect its input from STDIN and send its output to STDOUT.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. This stage only has one property and you must supply a value for this. The property appears in the warning color (red by default) until you supply a value.
Category/ Property Values Default
N/A

Mandatory?
Y

Repeats?
N

Dependent of
N/A

Options/Command Command Line Line

Options Category
Command Line Specifies the command line used for decoding the data set. The command line must configure the UNIX command to accept input from standard input and write its results to standard output. The command must be located in the search path of your application and be accessible by every processing node on which the Decode stage executes.

Advanced Tab
This tab allows you to specify the following:

36-2

Parallel Job Developers Guide

Decode Stage

Inputs Page

Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Decode stage expects a single incoming data set. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being decoded. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Compare stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Parallel Job Developers Guide

36-3

Outputs Page

Decode Stage

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is decoded. It also allows you to specify that the data should be sorted before being operated on. The Decode stage partitions in Same mode and this cannot be overridden. If the Decode stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following Collection methods are available: (Auto). This is the default collection method for Decode stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

Outputs Page
The Outputs page allows you to specify details about data output from the Decode stage. The Decode stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions for the decoded data. See Chapter 3, "Stage Editors," for a general description of the tabs.

36-4

Parallel Job Developers Guide

37
Switch Stage
The Switch stage is a processing stage. It can have a single input link, up to 128 output links and a single rejects link. The Switch stage takes a single data set as input and assigns each input row to an output data set based on the value of a selector field. The Switch stage performs an operation analogous to a C switch statement, which causes the flow of control in a C program to branch to one of several cases based on the value of a selector variable. Rows that satisfy none of the cases are output on the rejects link.

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting rows. Outputs Page. This is where you specify details about the processed data being output from the stage.

Example

Switch Stage

Example
The example Switch stage (as shown on the previous page) implements the following switch statement:
switch (selector) { case 0: // if selector = 0, // write record to output data set 0 break; case 10: // if selector = 10, // write record to output data set 1 break; case 12: // if selector = discard value (12) // skip record break; case default: // if selector is invalid, // send row down reject link };

The meta data input to the switch stage is as follows:

37-2

Parallel Job Developers Guide

Switch Stage

Must Dos

The column called Select is the selector; the value of this determines which output links the rest of the row will be output to. The properties of the stage are:

Must Dos
DataStage has many defaults which means that it can be very easy to include Switch stages in a job. This section specifies the minimum steps to take to get a Switch stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Switch stage: In the Stage Page Properties Tab, under the Input category choose the Selector mode:

User-defined Mapping. This is the default, and means that you must provide explicit mappings from case values to outputs. If you use this mode you specify the switch expression under the User-defined Mapping category. Auto. This can be used where there are as many distinct selector values as there are output links. Hash. The incoming rows are hashed on the selector column modulo the number of output links and assigned to an output link accordingly.

Parallel Job Developers Guide

37-3

Stage Page

Switch Stage

In all cases you need to use the Selector property to specify the input column that the switch is performed on. You can also specify whether the column is case sensitive or not. The other properties depend on which mode you have chosen:

If you have chosen the User-defined mapping mode, under the User-defined Mapping category specify the case expression in the case property. Under the Option category, select the If not found property to specify what action the stage takes if the column value does not correspond to any of the cases. Choose from Fail to have the job fail, Drop to drop the row, or Output to output it on the reject link.

If you have chosen the Auto mode, Under the Option category, select the If not found property to specify what action the stage takes if the column value does not correspond to any of the cases. Choose from Fail to have the job fail, Drop to drop the row, or Output to output it on the reject link. If you have chose the Hash mode there are no other properties to fill in.

In the Stage Page Link Ordering Tab, specify the order in which the output links are processed. In the Output Page Mapping Tab check that the input columns are mapping onto the output columns as you expect. The mapping is carried out according to what you specified in the Properties Tab.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link Ordering tab allows you to specify what order the output links are processed in. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default

37-4

Parallel Job Developers Guide

Switch Stage

Stage Page

settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Input/Selector Input/Case Sensitive Input/Selector Mode

Values
Input Column True/False User-defined mapping/Auto/ Hash String

Default
N/A True Userdefined mapping N/A

Mandatory?
Y N Y

Repeats? Dependent of
N N N N/A Selector N/A

User-defined Mapping/Case

Y (if Selector Mode = Userdefined mapping) Y (if Column Method = Schema file) N

N/A

Options/If not found

Pathname

N/A

N/A

Options/Discard Value

True/False

False

N/A

Input Category
Selector Specifies the input column that the switch applies to. Case Sensitive Specifies whether the column is case sensitive or not. Selector Mode Specifies how you are going to define the case statements for the switch. Choose between: User-defined Mapping. This is the default, and means that you must provide explicit mappings from case values to outputs. If you use this mode you specify the switch expression under the User-defined Mapping category. Auto. This can be used where there are as many distinct selector values as there are output links.

Parallel Job Developers Guide

37-5

Stage Page

Switch Stage

Hash. The incoming rows are hashed on the selector column modulo the number of output links and assigned to an output link accordingly.

User-defined Mapping Category


Case This property appears if you have chosen a Selector Mode of Userdefined Mapping. Specify the case expression in the case property. It has the following format:
Selector_Value[= Output_Link_Label_Number]

You must specify a selector value for each value of the input column that you want to direct to an output column. Repeat the Case property to specify multiple values. You can omit the output link label if the value is intended for the same output link as the case previously specified. For example, the case statements:
1990=0 1991 1992 1993=1 1994=1

would cause the rows containing the dates 1990, 1991, or 1992 in the selector column to be routed to output link 0, and the rows containing the dates 1993 to 1994 to be routed to output link 1.

Options Category
If not found Specifies the action to take if a row fails to match any of the case statements. This does not appear if you choose a Selector Mode of Hash. Otherwise, choose between: Fail. Causes the job to fail. Drop. Drops the row. Output. Routes the row to the Reject link. Discard Value You can use this property in conjunction with the case property to specify that rows containing certain values in the selector column will always be discarded. For example, if you defined the following case statement:
1995=5

37-6

Parallel Job Developers Guide

Switch Stage

Stage Page

and set the Discard Value property to 5, all rows containing 1995 in the selector column would be routed to link 5 which has been specified as the discard link and so will be dropped.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Parallel Job Developers Guide

37-7

Stage Page

Switch Stage

Link Ordering Tab


This tab allows you to specify which output links are associated with which link labels.

NLS Locale Tab


This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Switch stage uses this when evaluating case statements. Select a

37-8

Parallel Job Developers Guide

Switch Stage

Inputs Page

locale from the list, or click the arrow button next to the list to use a job parameter or browse for a collate file.

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Switch stage expects one incoming data set. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being switched. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Switch stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is switched. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of

Parallel Job Developers Guide

37-9

Inputs Page

Switch Stage

current and preceding stages and how many nodes are specified in the Configuration file. If the Column Import stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Switch stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Switch stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Switch stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Switch stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .

37-10

Parallel Job Developers Guide

Switch Stage

Inputs Page

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Switch stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being imported. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Parallel Job Developers Guide

37-11

Outputs Page

Switch Stage

Outputs Page
The Outputs page allows you to specify details about data output from the Switch stage. The Switch stage can have up to 128 output links, and can also have a reject link carrying rows that have been rejected. Choose the link you are working on from the Output name drop-down list. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Switch stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output links. Details about Switch stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Mapping Tab
For the Switch stage the Mapping tab allows you to specify how the output columns are derived.

The left pane shows the columns that have been switched. These are read only and cannot be modified on this tab. The right pane shows the output columns for each link. In the example the stage has mapped the specified switched columns onto the output columns.

37-12

Parallel Job Developers Guide

Switch Stage

Outputs Page

Reject Link
You cannot change the details of a Reject link. The link uses the column definitions for the link rejecting the data rows.

Parallel Job Developers Guide

37-13

Outputs Page

Switch Stage

37-14

Parallel Job Developers Guide

38
SAS Stage
The SAS stage is a processing stage. It can have multiple input links and multiple output links. The SAS stage allows you to execute part or all of an SAS application in parallel. It reduces or eliminates the performance bottlenecks that might otherwise occur when SAS is run on a parallel computer. (More information about using Enterprise Edition with SAS is given in SAS Stage Supplementary Guide.) Before using the SAS stage, you need to set up your configuration file to allow the system to interact with SAS, see "The SAS Resources" on page 58-28. DataStage enables SAS users to: Access, for reading or writing, large volumes of data in parallel from parallel relational databases, with much higher throughput than is possible using PROC SQL. Process parallel streams of data with parallel instances of SAS DATA and PROC steps, enabling scoring or other data transformations to be done in parallel with minimal changes to existing SAS code. Store large data sets in parallel, eliminating restrictions on dataset size imposed by your file system or physical disk-size limitations. Parallel data sets are accessed from SAS programs in the same way as conventional SAS data sets, but at much higher data I/O rates. Realize the benefits of pipeline parallelism, in which some number of SAS stages run at the same time, each receiving data from the previous process as it becomes available. See also the SAS Data Set stage, described in Chapter 11.

Parallel Job Developers Guide

38-1

Example Job

SAS Stage

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Example Job
This example job shows SAS stages reading in data, operating on it, then writing it out. The example data is from a freight carrier who charges customers based on distance, equipment, packing, and license requirements. They need a report of distance traveled and charges for the month of July grouped by License type. The following table shows a sample of the data:
Ship Date
... Jun 2 2000 Jul 12 2000 Aug 2 2000 1 1 1 1540 1320 1760 D D D M C C BUN SUM CUM 1300 4800 1300

District

Distance

Equipment

Packing

License

Charge

38-2

Parallel Job Developers Guide

SAS Stage

Example Job

Ship Date
Jun 22 2000 Jul 30 2000 ...

District
2 2

Distance
1540 1320

Equipment
D D

Packing
C M

License
CUN SUM

Charge
13500 6000

The job to handle the data looks like this:

The stage called SAS_0 reads the rows from the freight database where the first three characters of the Ship_date column = Jul You . reference the data being input to the this stage from inside the SAS code using the liborch library. Liborch is the SAS engine provided by DataStage that you use to reference the data input to and output from your SAS code. The following screenshot shows the SAS code for this stage:

Parallel Job Developers Guide

38-3

Example Job

SAS Stage

The stage called SAS_1 sorts the data that has been extracted by Stage_0. The code for this is as follows:

Finally, the stage called SAS_2 outputs the mean and sum of the distances traveled and charges for the month of July sorted by license type.

38-4

Parallel Job Developers Guide

SAS Stage

Must Dos

The following shows the SAS output for license type SUM:
17:39, May 26, 2003 ... LICENSE=SUM Variable Label N Mean Sum ------------------------------------------------------------------DISTANCE DISTANCE 720 1563.93 1126030.00 CHARGE CHARGE 720 28371.39 20427400.00 ------------------------------------------------------------------... Step execution finished with status = OK.

Must Dos
DataStage has many defaults which means that it can be very easy to include SAS stages in a job. This section specifies the minimum steps to take to get a SAS stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use an SAS stage: In the Stage Page Properties Tab, under the SAS Source category:

Specify the SAS code the stage will execute. You can also choose a Source Method of Source File and specify the name of a file containing the code. You need to use the liborch library in order to connect the SAS code to the data input to and/or output from the stage (see "Example Job" on page 38-2 for guidance how to do this).

Under the Inputs and Outputs categories:

Specify the numbers of the input and output links the stage connects to, and the name of the SAS data set associated with those links.

In the Stage Page Link Ordering Tab, specify which input and output link numbers correspond to which actual input or output link. In the Output Page Mapping Tab, specify how the data being operated on maps onto the output links.

Using the SAS Stage on NLS Systems


If your system is NLS enabled, and you are using English or European languages, then you should set the environment variable

Parallel Job Developers Guide

38-5

Stage Page

SAS Stage

APT_SASINT_COMMAND to point to the basic SAS executable (rather than the international one). For example:
APT_SASINT_COMMAND /usr/local/sas/sas8.2/sas

Alternatively, you can include a resource sas entry in your configuration file. For example:
resource sasint "[/usr/sas82/]" { }

(See Chapter 58 for details about configuration files and the SAS resources.) When using NLS with any map, you need to edit the file sascs.txt to identify the maps that you are likely to use with the SAS stage. The file is platform-specific, and is located in $APT_ORCHHOME/apt/etc/ platform, where platform is one of sun, aix, osf1 (Tru64), hpux, and linux. The file comprises two columns: the left-hand column gives an identifier, typically the name of the language. The right-hand column gives the name of the map. For example, if you were in Canada, your sascs.txt file might be as follows:
CANADIAN_FRENCH fr_CA-iso-8859 ENGLISH ISO-8859-5

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS Map tab appears if you have NLS enabled on your system, it allows you to specify a character set map for the stage.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

38-6

Parallel Job Developers Guide

SAS Stage

Stage Page

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
SAS Source/ Source Method SAS Source/ Source SAS Source/ Source File Inputs/Input Link Number Inputs/Input SAS Data Set Name. Outputs/Output Link Number Outputs/Output SAS Data Set Name. Outputs/Set Schema from Columns.

Values
Explicit/Source File code

Default
Explicit N/A

Mandatory?
Y Y (if Source Method = Explicit) Y (if Source Method = Source File) N Y (if input link number specified) N Y (if output link number specified) Y (if output link number specified) Y

Repeats?
N N

Dependent of
N/A N/A

pathname

N/A

N/A

number string

N/A N/A

Y N

N/A Input Link Number N/A Output Link Number Output Link Number

number string

N/A N/A

Y N

True/False

False

Options/Disable True/False Working Directory Warning Options/Convert Local Options/Debug Program True/False No/Verbose/ Yes

False

N/A

False No Job Log Job Log N/A N/A

Y Y Y Y N N

N N N N N N

N/A N/A N/A N/A N/A N/A

Options/SAS List File/Job Log/ File Location Type None/Output Options/SAS Log File/Job Log/ File Location Type None/Output Options/SAS Options Options/Working Directory string pathname

Parallel Job Developers Guide

38-7

Stage Page

SAS Stage

SAS Source Category


Source Method Choose from Explicit (the default) or Source File. You then have to set either the Source property or the Source File property to specify the actual source. Source Specify the SAS code to be executed. This can contain both PROC and DATA steps. Source File Specify a file containing the SAS code to be executed by the stage.

Inputs Category
Input Link Number Specifies inputs to the SAS code in terms of input link numbers. Repeat the property to specify multiple links. This has a dependent property: Input SAS Data Set Name. The name of the SAS data set receiving its input from the specified input link.

Outputs Category
Output Link Number Specifies an output link to connect to the output of the SAS code. Repeat the property to specify multiple links. This has a dependent property: Output SAS Data Set Name. The name of the SAS data set sending its output to the specified output link. Set Schema from Columns. Specify whether or not the columns specified on the Outputs tab are used for generating the output schema. An output schema is not required if the eventual destination stage is another SAS stage (there can be intermediate stages such as Data Set or Copy stages).

38-8

Parallel Job Developers Guide

SAS Stage

Stage Page

Options Category
Disable Working Directory Warning Disables the warning message generated by the stage when you omit the Working Directory property. By default, if you omit the Working Directory property, the SAS working directory is indeterminate and the stage generates a warning message. Convert Local Specify that the conversion phase of the SAS stage (from the input data set format to the stage SAS data set format) should run on the same nodes as the SAS stage. If this option is not set, the conversion runs by default with the previous stages degree of parallelism and, if possible, on the same nodes as the previous stage. Debug Program A setting of Yes causes the stage to ignore errors in the SAS program and continue execution of the application. This allows your application to generate output even if an SAS step has an error. By default, the setting is No, which causes the stage to abort when it detects an error in the SAS program. Setting the property as Verbose is the same as Yes, but in addition it causes the operator to echo the SAS source code executed by the stage. SAS List File Location Type Specifying File for this property causes the stage to write the SAS list file generated by the executed SAS code to a plain text file located in the project directory. The list is sorted before being written out. The name of the list file, which cannot be modified, is dsident .lst, where ident is the name of the stage, including an index in parentheses if there are more than one with the same name. For example, dssas(1).lst is the list file from the second SAS stage in a data flow. Specifying Job Log causes the list to be written to the DataStage job log. Specifying Output causes the list file to be written to an output data set of the stage. The data set from a parallel SAS stage containing the list information will not be sorted. If you specify None no list will be generated.

Parallel Job Developers Guide

38-9

Stage Page

SAS Stage

SAS Log File Location Type Specifying File for this property causes the stage to write the SAS list file generated by the executed SAS code to a plain text file located in the project directory. The list is sorted before being written out. The name of the list file, which cannot be modified, is dsident .lst, where ident is the name of the stage, including an index in parentheses if there are more than one with the same name. For example, dssas(1).lst is the list file from the second SAS stage in a data flow. Specifying Job Log causes the list to be written to the DataStage job log. Specifying Output causes the list file to be written to an output data set of the stage. The data set from a parallel SAS stage containing the list information will not be sorted. If you specify None no list will be generated. SAS Options Specify any options for the SAS code in a quoted string. These are the options that you would specify to an SAS OPTIONS directive. Working Directory Name of the working directory on all the processing nodes executing the SAS application. All relative pathnames in the SAS code are relative to this pathname.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

38-10

Parallel Job Developers Guide

SAS Stage

Stage Page

Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Link Ordering Tab


This tab allows you to specify how input links and output links are numbered. This is important when you are specifying Input Link Number and Output Link Number properties.

By default the first link added will be link 1, the second link 2 and so on. Select a link and use the arrow buttons to change its position.

NLS Map
The NLS Map tab allows you to define a character set map for the SAS stage. This overrides the default character set map set for the

Parallel Job Developers Guide

38-11

Inputs Page

SAS Stage

project or the job. You can specify that the map be supplied as a job parameter if required.

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. There can be multiple inputs to the SAS stage. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being passed to the SAS code. The Columns tab specifies the column definitions of incoming data.The Advanced tab allows you to change the default buffering settings for the input link. Details about SAS stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before passed to the SAS code. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of
38-12 Parallel Job Developers Guide

SAS Stage

Inputs Page

current and preceding stages and how many nodes are specified in the Configuration file. If the SAS stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the SAS stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the SAS stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the SAS stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the SAS stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .

Parallel Job Developers Guide

38-13

Inputs Page

SAS Stage

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for SAS stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being passed to the SAS code. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

38-14

Parallel Job Developers Guide

SAS Stage

Outputs Page

Outputs Page
The Outputs page allows you to specify details about data output from the SAS stage. The SAS stage can have multiple output links. Choose the link whose details you are viewing from the Output Name drop-down list. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the SAS stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output link. Details about SAS stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Mapping Tab
For the SAS stage the Mapping tab allows you to specify how the output columns are derived and how SAS data maps onto them.

The left pane shows the data output from the SAS code. These are read only and cannot be modified on this tab. The right pane shows the output columns for each link. This has a Derivations field where you can specify how the column is derived.

Parallel Job Developers Guide

38-15

Outputs Page

SAS Stage

You can fill it in by dragging input columns over, or by using the Automatch facility.

38-16

Parallel Job Developers Guide

39
Generic Stage
The Generic stage is a processing stage. It has any number of input links and any number of output links. The Generic stage allows you to call an Orchestrate operator from within a DataStage stage and pass it options as required.

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Must Dos

Generic Stage

Must Dos
DataStage has many defaults which means that it can be very easy to include Generic stages in a job. This section specifies the minimum steps to take to get a Generic stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Generic stage: In the Stage Page Properties Tab:

Specify the name of the Orchestrate operator the stage will call. Specify the name of any options the operator requires, and set its value. This can be repeated to specify multiple options.

In the Stage Page Link Ordering Tab, order your input and output links so they correspond to the required link number.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Options/Operator Options/Option name Options/Option Value

Values
Orchestrate operator String String

Default
N/A N/A N/A

Mandatory?
Y N N

Repeats?
N Y N

Dependent of
N/A N/A Option name

39-2

Parallel Job Developers Guide

Generic Stage

Stage Page

Options Category
Operator Specify the name of the Orchestrate operator the stage will call. Option name Specify the name of an option the operator requires. This has a dependent property: Option Value The value the option is to be set to.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Parallel Job Developers Guide

39-3

Inputs Page

Generic Stage

Link Ordering Tab


This tab allows you to specify how input and output links correspond to link labels.

To rearrange the links, choose an output link and click the up arrow button or the down arrow button.

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Generic stage can accept multiple incoming data sets. Select the link whose details you are looking at from the Input name drop-down list. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being operated on. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Generic stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is operated on. It

39-4

Parallel Job Developers Guide

Generic Stage

Inputs Page

also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the Generic stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Generic stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Generic stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Generic stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method of the Generic stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .
Parallel Job Developers Guide 39-5

Inputs Page

Generic Stage

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Generic stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being operated on. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

39-6

Parallel Job Developers Guide

Generic Stage

Outputs Page

Outputs Page
The Outputs page allows you to specify details about data output from the Generic stage. The Generic stage can have any number of output links. Select the link whose details you are looking at from the Output name drop-down list. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output links. See Chapter 3, "Stage Editors," for a general description of these tabs.

Parallel Job Developers Guide

39-7

Outputs Page

Generic Stage

39-8

Parallel Job Developers Guide

40
Surrogate Key Stage
The Surrogate Key stage is a processing stage. It can have a single input and a single output. The Surrogate Key stage generates key columns for an existing data set. You can specify certain characteristics of the key sequence. The stage generates sequentially incrementing unique integers from a given starting point. The existing columns of the data set are passed straight through the stage.

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the input data. Outputs Page. This is where you specify details about the data being output from the stage.

Parallel Job Developers Guide

40-1

Key Space

Surrogate Key Stage

Key Space
If the stage is operating in parallel, each node will increment the key by the number of partitions being written to. The basic operation is illustrated below.
Key Partition 0 Node A 4 8 12

Partition 1 Node B 5 9 13 Four partitions, so key is incremented by four

Partition 2 Node C 6 10 14

Partition Node D 3 7 11 15

40-2

Parallel Job Developers Guide

Surrogate Key Stage

Key Space

If however, your partions are not balanced, you may end up with holes in your key space:
Key Partition 0 Node A 4 8

Partition 1 Node B 5 9 13

Keys 12, 10, 14 not used

Partition 2 Node C 6

Partition Node D 3 7 11 15

To guarantee that there are no holes in the key space (i.e., all available keys are used) the incoming data set partitions should be perfectly balanced. This can be achieved using the round robin partitioning method where your starting point is sequential (i.e., non-partitioned) data. Note that, if the Surrogate Key stage (or other preceding stage) repartitions already partitioned data for some reason, then a hole-free keyspace cannot be guaranteed, whatever method the repartitioning uses. The following illustrates what happens when four balanced partitions are repartitioned into three using the round robin method. Each of the original partitions is repartioned independently, and each one starts with the first of the new partitions. This results in partitions

Parallel Job Developers Guide

40-3

Examples

Surrogate Key Stage

that are near balanced rather than perfectly balanced, and holes in the keyspace.
Data repartitioned from four partitions to three partitions

Keys 13,14,16,17,19,20 missing 0 3 6 9 12 15 18 21

1 4 7 10

2 5 8 11

Examples
This section gives examples of input and output data from a Surrogate Key stage to give you a better idea of how the stage works.

40-4

Parallel Job Developers Guide

Surrogate Key Stage

Examples

In this example the input data set is as follows:

The stage adds two surrogate key columns called surr_key1 and surr_key2. A unique value for surr_key1 and surr_key2 is generated for each row of input data. You have to give DataStage information about how to generate the surrogates. This is done on the Stage page Properties Tab. For this example, you specify:

Parallel Job Developers Guide

40-5

Must Dos

Surrogate Key Stage

The output data set will be:

Must Dos
DataStage has many defaults which means that it can be very easy to include Surrogate Key stages in a job. This section specifies the minimum steps to take to get a Surrogate Key stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Surrogate Key stage: In the Stage Page Properties Tab, under the Keys category:

Specify the column that will contain the surrogate key (choose an output column if you have defined output column meta data, or type in a name). Specify the type of the surrogate key. This is one of 16-bit, 32bit, or 64-bit integer. Specify a start number for the key. This is 0 by default. You can also specify a job parameter so that the starting number can be supplied at run time.

In the Output Page Mapping Tab check that the input columns are mapping onto the output columns as you expect. The mapping is carried out according to what you specified in the Properties Tab.

40-6

Parallel Job Developers Guide

Surrogate Key Stage

Stage Page

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties that determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Keys/Surrogate Key Name Keys/Output Type

Values
string 16-bit integer, 32-bit integer, 64-bit integer

Default
N/A 32-bit integer

Mandatory? Repeats? Dependent of


Y Y Y N N/A Surrogate Key Name

Keys/Start Value

number

Surrogate Key Name

Keys Category
Surrogate Key Name Specify the name of the surrogate key to be generated. You can select an existing output column, or type in a name. You can repeat this property to specify multiple keys. It has the following dependent properties: Output Type Specify the column type of the new property. Choose from:

16-bit integer 32-bit integer 64-bit Integer

The default is 32-bit integer.

Parallel Job Developers Guide

40-7

Inputs Page

Surrogate Key Stage

Start Value Specify the initial value for the key generation. It defaults to 0. You can also specify a job parameter, so the start value can be supplied at run time.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Surrogate Key stage expects one incoming data set. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being imported. The Columns tab specifies

40-8

Parallel Job Developers Guide

Surrogate Key Stage

Inputs Page

the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Surrogate Key stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is imported. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the Surrogate Key stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Surrogate Key stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Surrogate Key stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type dropdown list. This will override any current partitioning. If the Surrogate Key stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Surrogate Key stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

Parallel Job Developers Guide

40-9

Inputs Page

Surrogate Key Stage

Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Surrogate Key stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being imported. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

40-10

Parallel Job Developers Guide

Surrogate Key Stage

Outputs Page

Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Surrogate Key stage. The Surrogate Key stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. Note that the key field will be selected for the output columns carrying the generated keys. The Mapping tab allows you to specify the relationship between the columns being input to the Surrogate Key stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output links. Details about Surrogate Key stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Parallel Job Developers Guide

40-11

Outputs Page

Surrogate Key Stage

Mapping Tab
For the Surrogate Key stage the Mapping tab allows you to specify how the output columns are derived.

The left pane shows the columns on the input link plus any surrogate that the stage is generating. The right pane shows the output columns. In the example the columns have been mapped straight across.

40-12

Parallel Job Developers Guide

41
Column Import Stage
The Column Import stage is a restructure stage. It can have a single input link, a single output link and a single rejects link. The complement to this stage is the Column Export stage, described in Chapter 42. The Column Import stage imports data from a single column and outputs it to one or more columns. You would typically use it to divide data arriving in a single column into multiple columns. The data would be fixed-width or delimited in some way to tell the Column Import stage where to make the divisions. The input column must be a string or binary data, the output columns can be any data type. You supply an import table definition to specify the target columns and their types. This also determines the order in which data from the import column is written to output columns. Information about the format of the incoming column (e.g., how it is delimited) is given in the Format tab of the Outputs page. You can optionally save reject records, that is, records whose import was rejected, and write them to a rejects link. In addition to importing a column you can also pass other columns straight through the stage. So, for example, you could pass a key column straight through.

Parallel Job Developers Guide

41-1

Examples

Column Import Stage

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Examples
This section gives examples of input and output data from a Column Import stage to give you a better idea of how the stage works. In this example the Column Import stage extracts data from 16-byte raw data field into four integer output fields. The input data set also contains a column which is passed straight through the stage. The

41-2

Parallel Job Developers Guide

Column Import Stage

Examples

example assumes that the job is running sequentially. The screenshot shows the column definitions for the input data set.

The following are the rows from the input data set:

Parallel Job Developers Guide

41-3

Examples

Column Import Stage

The import table definition can either be supplied on the Outputs Page Columns Tab or in a schema file. For the example, the definition would be:

You have to give DataStage information about how to treat the imported data to split it into the required columns. This is done on the Outputs page Format Tab. For this example, you specify a data format of binary to ensure that the contents of col_to_import are interpreted as binary integers, and that the data has a field delimiter of none.

41-4

Parallel Job Developers Guide

Column Import Stage

Must Dos

The properties of the Column Import stage are set as follows:

The output data set will be:

Must Dos
DataStage has many defaults which means that it can be very easy to include Column Import stages in a job. This section specifies the minimum steps to take to get a Column Import stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product.

Parallel Job Developers Guide

41-5

Stage Page

Column Import Stage

To use a Column Import stage: In the Stage Page Properties Tab, under the Input category:

Specify the column that you are importing.

Under the Output category:

Choose the Column method, this is Explicit by default, meaning you specify explicitly choose output columns as destinations. The alternative is to specify a schema file. If you are using the Explicit method, choose the output column(s) to carry your imported input column. Repeat the Column to Import property to specify all the columns you need. If you are using the Schema File method, specify the schema file that gives the output column details.

In the Output Page Format Tab specify the format of the column you are importing. This informs DataStage about data format and enables it to divide a single column into multiple columns. In the Output Page Mapping Tab check that the input columns are mapping onto the output columns as you expect. The mapping is carried out according to what you specified in the Properties Tab.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Input/Import Input Column

Values
Input Column

Default
N/A

Mandatory?
Y

Repeats?
N

Dependent of
N/A

41-6

Parallel Job Developers Guide

Column Import Stage

Stage Page

Category/ Property
Output/Column Method Output/Column to Import Output/Schema File

Values

Default

Mandatory?
Y Y (if Column Method = Explicit) Y (if Column Method = Schema file) N N

Repeats?
N Y

Dependent of
N/A N/A

Explicit/Schema Explicit File Output Column N/A

Pathname

N/A

N/A

Options/Keep Input Column Options/Reject Mode

True/False Continue (warn) /Output/ Fail

False Continue

N N

N/A N/A

Input Category
Import Input Column Specifies the name of the column containing the string or binary data to import.

Output Category
Column Method Specifies whether the columns to import should be derived from column definitions on the Output page Columns tab (Explicit) or from a schema file (Schema File). Column to Import Specifies an output column. The meta data for this column determines the type that the import column will be converted to. Repeat the property to specify multiple columns. You can use the Column Selection dialog box to select multiple columns at once if required (see page 3-10). You can specify the properties for each column using the Parallel tab of the Edit Column Meta dialog box (accessible from the shortcut menu on the columns grid of the output Columns tab). The order of the Columns to Import that you specify should match the order on the Columns tab.

Parallel Job Developers Guide

41-7

Stage Page

Column Import Stage

Schema File Instead of specifying the source data type details via output column definitions, you can use a schema file (note, however, that if you have defined columns on the Columns tab, you should ensure these match the schema file). Type in a pathname or browse for a schema file.

Options Category
Keep Input Column Specifies whether the original input column should be transferred to the output data set unchanged in addition to being imported and converted. Defaults to False. Reject Mode The values of this property specify the following actions: Fail. The stage fails when it encounters a record whose import is rejected. Output. The stage continues when it encounters a reject record and writes the record to the reject link. Continue. The stage is to continue but report failures to the log file.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

41-8

Parallel Job Developers Guide

Column Import Stage

Inputs Page

Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Column Import stage expects one incoming data set. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being imported. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Column Import stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is imported. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the Column Import stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Column Import stage is set to execute in parallel or sequential mode.

Parallel Job Developers Guide

41-9

Inputs Page

Column Import Stage

Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Column Import stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type dropdown list. This will override any current partitioning. If the Column Import stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Column Import stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Column Import stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on.
41-10 Parallel Job Developers Guide

Column Import Stage

Outputs Page

Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being imported. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Column Import stage. The Column Import stage can have only one output link, but can also have a reject link carrying records that have been rejected. The General tab allows you to specify an optional description of the output link. The Format tab allows you to specify details about how data in the column you are importing is formatted so the stage can divide it into separate columns. The Columns tab specifies the

Parallel Job Developers Guide

41-11

Outputs Page

Column Import Stage

column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Column Import stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output links. Details about Column Import stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Format Tab
The Format tab allows you to supply information about the format of the column you are importing. You use it in the same way as you would to describe the format of a flat file you were reading. The tab has a similar format to the Properties tab and is described in detail on page 3-44. Select a property type from the main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to add window. You can then set a value for that property in the Property Value box. Pop up help for each of the available properties appears if you hover the mouse pointer over it. Any property that you set on this tab can be overridden at the column level by setting properties for individual columns on the Edit Column Metadata dialog box (see page 3-26). This description uses the terms record and row and field and column interchangeably. The following sections list the Property types and properties available for each type. Record level These properties define details about how data records are formatted in the flat file. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Fill char. Does not apply to output links. Final delimiter string. Specify the string written after the last column of a record in place of the column delimiter. Enter one or more characters, this precedes the record delimiter if one is used. Mutually exclusive with Final delimiter, which is the default. For example, if you set Delimiter to comma (see under "Field Defaults" for Delimiter) and Final delimiter string to , (comma space you do not need to enter the inverted commas) all fields are delimited

41-12

Parallel Job Developers Guide

Column Import Stage

Outputs Page

by a comma, except the final field, which is delimited by a comma followed by an ASCII space character. DataStage skips the specified delimiter string when reading the file. Final delimiter. Specify the single character written after the last column of a record in place of the field delimiter. Type a character or select one of whitespace, end, none, null, tab, or comma. DataStage skips the specified delimiter string when reading the file. See the following diagram for an illustration.

whitespace. The last column of each record will not include any trailing white spaces found at the end of the record. end. The last column of each record does not include the field delimiter. This is the default setting. none. The last column of each record does not have a delimiter, used for fixed-width fields. null. The last column of each record is delimited by the ASCII null character. comma. The last column of each record is delimited by the ASCII comma character. tab. The last column of each record is delimited by the ASCII tab character.
Record delimiter Field 1 , Field 1 , Field 1 , Last field nl

Final Delimiter = end Field Delimiter

Field 1

Field 1

Field 1

, Last field

, nl

Final Delimiter = comma

Intact. The intact property specifies an identifier of a partial schema. A partial schema specifies that only the column(s) named in the schema can be modified by the stage. All other columns in the row are passed through unmodified. (See "Partial Schemas" in Appendix A for details.) The file containing the partial schema is specified in the Schema File property on the Outputs tab. This property has a dependent property:

Check intact. Select this to force validation of the partial schema as the file or files are imported. Note that this can degrade performance.

Record delimiter string. Specify the string at the end of each record. Enter one or more characters. This is mutually exclusive with Record delimiter, which is the default, and record type and record prefix.

Parallel Job Developers Guide

41-13

Outputs Page

Column Import Stage

Record delimiter. Specify the single character at the end of each record. Type a character or select one of the following:

UNIX Newline (the default) null

(To specify a DOS newline, use the Record delimiter string property set to \R\N or choose Format as DOS line terminator from the shortcut menu.) Record delimiter is mutually exclusive with Record delimiter string, Record prefix, and record type. Record length. Select Fixed where fixed length fields are being read. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes. This is not used by default (default files are commadelimited). Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. It is set to 1 by default. This is mutually exclusive with Record delimiter, which is the default, and record delimiter string and record type. Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is written as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, VBS, or VR. This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix and by default is not used. Field Defaults Defines default properties for columns read from the file or files. These are applied to all columns, but can be overridden for individual columns from the Columns tab using the Edit Column Metadata dialog box. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Actual field length. Specifies the actual number of bytes to skip if the fields length equals the setting of the null field length property.

41-14

Parallel Job Developers Guide

Column Import Stage

Outputs Page

Delimiter. Specifies the trailing delimiter of all fields in the record. Type an ASCII character or select one of whitespace, end, none, null, comma, or tab. DataStage skips the delimiter when reading.

whitespace. Whitespace characters at the end of a column are ignored, i.e., are not treated as part of the column. end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the same as a setting of None which is used for fields with fixed-width columns. none. No delimiter (used for fixed-width). null. ASCII Null character is used. comma. ASCII comma character is used. tab. ASCII tab character is used.

Delimiter string. Specify the string at the end of each field. Enter one or more characters. This is mutually exclusive with Delimiter, which is the default. For example, specifying , (comma space you do not need to enter the inverted commas) specifies each field is delimited by , unless overridden for individual fields. DataStage skips the delimiter string when reading. Null field length. The length in bytes of a variable-length field that contains a null. When a variable-length field is read, a length of null field length in the source field indicates that it contains a null. This property is mutually exclusive with null field value. Null field value. Specifies the value given to a null field if the source is set to null. Can be a number, string, or C-type literal escape character. For example, you can represent a byte value by \ooo, where each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F You must use this . form to encode non-printable byte values. This property is mutually exclusive with Null field length and Actual length. For a fixed width data representation, you can use Pad char (from the general section of Type defaults) to specify a repeated trailing character if the value you specify is shorter than the fixed width of the field. Prefix bytes. You can use this option with variable-length fields. Variable-length fields can be either delimited by a character or preceded by a 1-, 2-, or 4-byte prefix containing the field length. DataStage reads the length prefix but does not include the prefix as a separate field in the data set it reads from the file. This property is mutually exclusive with the Delimiter, Quote, and Final Delimiter properties, which are used by default.

Parallel Job Developers Guide

41-15

Outputs Page

Column Import Stage

Print field. This property is intended for use when debugging jobs. Set it to have DataStage produce a message for every field it reads. The message has the format:
Importing N: D

where:

N is the field name. D is the imported data of the field. Non-printable characters conained in D are prefixed with an escape character and written as C string literals; if the field contains binary data, it is output in octal format.

Quote. Specifies that variable length fields are enclosed in single quotes, double quotes, or another character or pair of characters. Choose Single or Double, or enter a character. This is set to double quotes by default. When reading, DataStage ignores the leading quote character and reads all bytes up to but not including the trailing quote character. Vector prefix. For fields that are variable length vectors, specifies that a 1-, 2-, or 4-byte prefix contains the number of elements in the vector. You can override this default prefix for individual vectors. Variable-length vectors must use either a prefix on the vector or a link to another field in order to specify the number of elements in the vector. If the variable length vector has a prefix, you use this property to indicate the prefix length. DataStage reads the length prefix but does not include it as a separate field in the data set. By default, the prefix length is assumed to be one byte. Type Defaults These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type. General These properties apply to several data types (unless overridden at column level): Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

little-endian. The high byte is on the right. big-endian. The high byte is on the left. native-endian. As defined by the native format of the machine. This is the default.
Parallel Job Developers Guide

41-16

Column Import Stage

Outputs Page

Data Format. Specifies the data representation format of a field. Applies to fields of all data types except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that is neither string nor raw. Choose from:

binary text (the default)

A setting of binary has different meanings when applied to different data types:

For decimals, binary means packed. For other numerical data types, binary means not text . For dates, binary is equivalent to specifying the julian property for the date field. For time, binary is equivalent to midnight_seconds. For timestamp, binary specifies that the first integer contains a Julian day count for the date portion of the timestamp and the second integer specifies the time portion of the timestamp as the number of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.

By default data is formatted as text, as follows:

For the date data type, text specifies that the data read, contains a text-based date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the decimal data type: a field represents a decimal in a string format with a leading space or '-' followed by decimal digits with an embedded decimal point if the scale is not zero. The destination string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored. For numeric fields (int8, int16, int32, uint8, uint16, uint32, sfloat, and dfloat): DataStage assumes that numeric fields are represented as text. For the time data type: text specifies that the field represents time in the text-based form %hh:%nn:%ss or or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the timestamp data type: text specifies a text-based timestamp in the form %yyyy-%mm-%dd %hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see NLS Guide).

(See page 2-28 for a description of data types.)

Parallel Job Developers Guide

41-17

Outputs Page

Column Import Stage

Field max width. The maximum number of bytes in a column represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width character set, you can calculate the length exactly. If you are using variable-length character set, calculate an adequate maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. Field width. The number of bytes in a field represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the number of bytes exactly. If its a variable length encoding, base your calculation on the width and frequency of your variable-width characters. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. If you specify neither field width nor field max width, numeric fields written as text have the following number of bytes as their maximum width:

8-bit signed or unsigned integers: 4 bytes 16-bit signed or unsigned integers: 6 bytes 32-bit signed or unsigned integers: 11 bytes 64-bit signed or unsigned integers: 21 bytes single-precision float: 14 bytes (sign, digit, decimal point, 7 fraction, "E", sign, 2 exponent) double-precision float: 24 bytes (sign, digit, decimal point, 16 fraction, "E", sign, 3 exponent)

Pad char. This property is ignored for output links. Character set. Specifies the character set. Choose from ASCII or EBCDIC. The default is ASCII. Applies to all data types except raw and ustring and record, subrec, or tagged containing no fields other than raw or ustring. String These properties are applied to columns with a string data type, unless overridden at column level. Export EBCDIC as ASCII. Not relevant for output links. Import ASCII as EBCDIC. Select this to specify that ASCII characters are read as EBCDIC characters. For ASCII-EBCDIC and EBCDIC-ASCII coversion tables, see DataStage Developers Help.

41-18

Parallel Job Developers Guide

Column Import Stage

Outputs Page

Decimal These properties are applied to columns with a decimal data type unless overridden at column level. Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No. The default is No. Decimal separator. Specify the ASCII character that acts as the decimal separator (period by default). Packed. Select an option to specify what the decimal columns contain, choose from:

Yes to specify that the decimal fields contain data in packed decimal format (the default). This has the following subproperties: Check. Select Yes to verify that data is packed, or No to not verify. Signed. Select Yes to use the existing sign when reading decimal fields. Select No to write a positive sign (0xf) regardless of the fields actual sign value.

No (separate) to specify that they contain unpacked decimal with a separate sign byte. This has the following sub-property: Sign Position. Choose leading or trailing as appropriate.

No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This has the following subproperty: Sign Position. Choose leading or trailing as appropriate.

No (overpunch) to specify that the field has a leading or end byte that contains a character which specifies both the numeric value of that byte and whether the number as a whole is negatively or positively signed. This has the following subproperty: Sign Position. Choose leading or trailing as appropriate.

Precision. Specifies the precision of a packed decimal. Enter a number. Rounding. Specifies how to round the source field to fit into the destination decimal when reading a source field to a decimal. Choose from:

up (ceiling). Truncate source column towards positive infinity. This mode corresponds to the IEEE 754 Round Up mode. For example, 1.4 becomes 2, -1.6 becomes -1.

Parallel Job Developers Guide

41-19

Outputs Page

Column Import Stage

down (floor). Truncate source column towards negative infinity. This mode corresponds to the IEEE 754 Round Down mode. For example, 1.6 becomes 1, -1.4 becomes -2. nearest value. Round the source column towards the nearest representable value. This mode corresponds to the COBOL ROUNDED mode. For example, 1.4 becomes 1, 1.5 becomes 2, 1.4 becomes -1, -1.5 becomes -2. truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, truncate to the scale size of the destination decimal. This mode corresponds to the COBOL INTEGER-PART function. Using this method 1.6 becomes 1, -1.6 becomes -1.

Scale. Specifies the scale of a source packed decimal. Numeric These properties apply to integer and float fields unless overridden at column level. C_format. Perform non-default conversion of data from string data to a integer or floating-point. This property specifies a Clanguage format string used for reading integer or floating point strings. This is passed to sscanf(). For example, specifying a Cformat of %x and a field width of 8 ensures that a 32-bit integer is formatted as an 8-byte hexadecimal string. In_format. Format string used for conversion of data from string to integer or floating-point data This is passed to sscanf(). By default, DataStage invokes the C sscanf() function to convert a numeric field formatted as a string to either integer or floating point data. If this function does not output data in a satisfactory format, you can specify the in_format property to pass formatting arguments to sscanf(). Out_format. This property is not relevant for output links. Date These properties are applied to columns with a date data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text. Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide).

41-20

Parallel Job Developers Guide

Column Import Stage

Outputs Page

Format string. The string format of a date. By default this is %yyyy-%mm-%dd. The Format string can contain one or a combination of the following elements:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff, for example %1970yy. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 366). %mmm: Three-character month abbreviation.

The format_string is subject to the following restrictions:


It cannot have more than one element of the same type, for example it cannot contain two %dd elements. It cannot have both %dd and %ddd. It cannot have both %yy and %yyyy. It cannot have both %mm and %ddd. It cannot have both %mmm and %ddd. It cannot have both %mm and %mmm. If it has %dd, it must have %mm or %mmm. It must have exactly one of %yy or %yyyy.

When you specify a date format string, prefix each component with the percent symbol (%). Separate the strings components with any character except the percent sign (%). If this format string does not include a day, it is set to the first of the month in the destination field. If the format string does not include the month and day, they default to January 1. Note that the format string must contain a month if it also contains a day; that is, you cannot omit only the month. The year_cutoff is the year defining the beginning of the century in which all two digit years fall. By default, the year cutoff is 1900; therefore, a two-digit year of 97 represents 1997. You can also set this using the environment variable APT_DATE_CENTURY_BREAK_YEAR (see "APT_DATE_CENTURY_BREAK_YEAR" in Parallel Job Advanced Developers Guide), but this is overridden by %year_cutoffyy if you have set it.

Parallel Job Developers Guide

41-21

Outputs Page

Column Import Stage

You can specify any four-digit year as the year cutoff. All two-digit years then specify the next possible year ending in the specified two digits that is the same or greater than the cutoff. For example, if you set the year cutoff to 1930, the twodigit year 30 corresponds to 1930, and the two-digit year 29 corresponds to 2029. Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT. Time These properties are applied to columns with a time data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text. Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss. The possible components of the time format string are:

%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent symbol. Separate the strings components with any character except the percent sign (%). Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight. Timestamp These properties are applied to columns with a timestamp data type unless overridden at column level. Format string. Specifies the format of a column representing a timestamp as a string. Defaults to %yyyy-%mm-%dd %hh:%nn:%ss. Specify the format as follows: For the date:

%dd: A two-digit day.

41-22

Parallel Job Developers Guide

Column Import Stage

Outputs Page

%mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 - 366).

For the time:


%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent symbol (%). Separate the strings components with any character except the percent sign (%).

Mapping Tab
For the Column Import stage the Mapping tab allows you to specify how the output columns are derived.

The left pane shows the columns the stage is deriving from the single imported column. These are read only and cannot be modified on this tab.
Parallel Job Developers Guide 41-23

Using RCP With Column Import Stages

Column Import Stage

The right pane shows the output columns for each link. In the example the stage has automatically mapped the specified Columns to Import onto the output columns. The Key column is an extra input column and is automatically passed through the stage. Because the Keep Import Column property was set to True, the original column (comp_col in this example) is available to map onto an output column. We recommend that you maintain the automatic mappings of the generated columns when using this stage.

Reject Link
You cannot change the details of a Reject link. The link uses the column definitions for the link rejecting the data records.

Using RCP With Column Import Stages


Runtime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between. Columns you are importing do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on Column Import stages if you have used the Schema File property (see "Schema File" on page 41-8) to specify a schema which describes all the columns in the column. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are: Sequential File File Set External Source External Target Column Import Column Export

41-24

Parallel Job Developers Guide

Column Import Stage

Using RCP With Column Import Stages

Parallel Job Developers Guide

41-25

Using RCP With Column Import Stages

Column Import Stage

41-26

Parallel Job Developers Guide

42
Column Export Stage
The Column Export stage is a restructure stage. It can have a single input link, a single output link and a single rejects link. The Column Export stage exports data from a number of columns of different data types into a single column of data type string or binary. It is the complementary stage to Column Import (see Chapter 41). The input data column definitions determine the order in which the columns are exported to the single output column. Information about how the single column being exported is delimited is given in the Formats tab of the Inputs page. You can optionally save reject records, that is, records whose export was rejected. In addition to exporting a column you can also pass other columns straight through the stage. So, for example, you could pass a key column straight through.

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage.

Parallel Job Developers Guide

42-1

Examples

Column Export Stage

Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Examples
This section gives examples of input and output data from a Column Export stage to give you a better idea of how the stage works. In this example the Column Export stage extracts data from three input columns and outputs two of them in a single column of type string and passes the other through. The example assumes that the job is running sequentially. The screenshot shows the column definitions for the input data set.

The following are the rows from the input data set:

42-2

Parallel Job Developers Guide

Column Export Stage

Examples

The import table definition is supplied on the Outputs Page Columns Tab. For our example, the definition would be:

You have to give DataStage information about how to delimit the exported data when it combines it into a single column. This is done on the Inputs page Format Tab. For this example, you specify a data format of text, a Field Delimiter of comma, and a Quote type of double.

Parallel Job Developers Guide

42-3

Must Dos

Column Export Stage

The Properties of the Column Export stage are set as follows:

The output data set will be:

Must Dos
DataStage has many defaults which means that it can be very easy to include Column Export stages in a job. This section specifies the minimum steps to take to get a Column Export stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Column Export stage:

42-4

Parallel Job Developers Guide

Column Export Stage

Stage Page

In the Stage Page Partitioning Tab, under the Input category:

Choose the Column method, this is Explicit by default, meaning you specify explicitly choose input columns as sources. The alternative is to specify a schema file. If you are using the Explicit method, choose the input column(s) to carry your exported input column. Repeat the Column to Export property to specify all the columns you need. If you are using the Schema File method, specify the schema file that gives the output column details.

Under the Output category:

Choose the Export Column Type. This is Binary by default, but you can also choose VarChar. This specifies the format of the column you are exporting to. Specify the column you are exporting to in the Export Output Column property.

In the Input Page Format Tab specify the format of the column you are exporting. This informs DataStage about delimiters and enables it to combine multiple columns into a single column with delimiters. In the Output Page Mapping Tab check that the input columns are mapping onto the output columns as you expect. The mapping is carried out according to what you specified in the Properties Tab.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

Parallel Job Developers Guide

42-5

Stage Page

Column Export Stage

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Options/Export Output Column Options/Export Column Type Options/Reject Mode Options/Column to Export Options/Schema File

Values
Output Column

Default
N/A

Mandatory?
Y N N N N

Repeats?
N N N Y N

Dependent of
N/A N/A N/A N/A N/A

Binary/ VarChar Binary Continue (warn) /Output Input Column Pathname Continue N/A N/A

Options Category
Export Output Column Specifies the name of the single column to which the input column or columns are exported. Export Column Type Specify either binary or VarChar (string). Reject Mode The values of this property specify the following actions: Output. The stage continues when it encounters a reject record and writes the record to the rejects link. Continue(warn). The stage is to continue but report failures to the log file. Column to Export Specifies an input column the stage extracts data from. The format properties for this column can be set on the Format tab of the Inputs page. Repeat the property to specify multiple input columns. You can use the Column Selection dialog box to select multiple columns at once if required (see page 3-10). The order of the Columns to Export that you specify should match the order on the Columns tab. If it does not, the order on the Columns tab overrides the order of the properties.

42-6

Parallel Job Developers Guide

Column Export Stage

Inputs Page

Schema File Instead of specifying the source data details via input column definitions, you can use a schema file (note, however, that if you have defined columns on the Columns tab, you should ensure these match the schema file). Type in a pathname or browse for a schema file.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Column Export stage expects one incoming data set. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being exported. The Format tab allows you
Parallel Job Developers Guide 42-7

Inputs Page

Column Export Stage

to specify details how data in the column you are exporting will be formatted. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Column Export stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is exported. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the Column Export stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Column Export stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Column Export stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type dropdown list. This will override any current partitioning. If the Column Export stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Column Export stage. Entire. Each file written to receives the entire data set.

42-8

Parallel Job Developers Guide

Column Export Stage

Inputs Page

Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag columns. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Column Export stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being exported. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default Auto methods). Select the check boxes as follows:

Parallel Job Developers Guide

42-9

Inputs Page

Column Export Stage

Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Format Tab
The Format tab allows you to supply information about the format of the column you are exporting. You use it in the same way as you would to describe the format of a flat file you were writing. The tab has a similar format to the Properties tab and is described in detail on page 3-44. Select a property type from the main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to add window. You can then set a value for that property in the Property Value box. Pop up help for each of the available properties appears if you over the mouse pointer over it. This description uses the terms record and row and field and column interchangeably. The following sections list the Property types and properties available for each type. Record level These properties define details about how data records are formatted in the flat file. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Fill char. Specify an ASCII character or a value in the range 0 to 255. You can also choose Space or Null from a drop-down list. This character is used to fill any gaps in a written record caused by
42-10 Parallel Job Developers Guide

Column Export Stage

Inputs Page

column positioning properties. Set to 0 by default (which is the NULL character). For example, to set it to space you could also type in the space character or enter 32. Note that this value is restricted to one byte, so you cannot specify a multi-byte Unicode character. Final delimiter string. Specify a string to be written after the last column of a record in place of the column delimiter. Enter one or more characters, this precedes the record delimiter if one is used. Mutually exclusive with Final delimiter, which is the default. For example, if you set Delimiter to comma (see under "Field Defaults" for Delimiter) and Final delimiter string to , (comma space you do not need to enter the inverted commas) all fields are delimited by a comma, except the final field, which is delimited by a comma followed by an ASCII space character. Final delimiter. Specify a single character to be written after the last column of a record in place of the field delimiter. Type a character or select one of whitespace, end, none, null, tab, or comma. See the following diagram for an illustration.

whitespace. The last column of each record will not include any trailing white spaces found at the end of the record. end. The last column of each record does not include the field delimiter. This is the default setting. none. The last column of each record does not have a delimiter; used for fixed-width fields. null. The last column of each record is delimited by the ASCII null character. comma. The last column of each record is delimited by the ASCII comma character. tab. The last column of each record is delimited by the ASCII tab character.
Record delimiter Field 1 , Field 1 , Field 1 , Last field nl

Final Delimiter = end Field Delimiter

Field 1

Field 1

Field 1

, Last field

, nl

Final Delimiter = comma

When writing, a space is now inserted after every field except the last in the record. Previously, a space was inserted after every field including the last. (If you want to revert to the pre-release 7.5 behavior of inserting a space after the last field, set the APT_FINAL_DELIM_COMPATIBLE environment variable.

Parallel Job Developers Guide

42-11

Inputs Page

Column Export Stage

Intact. The intact property specifies an identifier of a partial schema. A partial schema specifies that only the column(s) named in the schema can be modified by the stage. All other columns in the row are passed through unmodified. (See "Partial Schemas" in Appendix A for details.) The file containing the partial schema is specified in the Schema File property on the Properties tab (see page 5-9). This property has a dependent property, Check intact, but this is not relevant to input links. Record delimiter string. Specify a string to be written at the end of each record. Enter one or more characters. This is mutually exclusive with Record delimiter, which is the default, record type and record prefix. Record delimiter. Specify a single character to be written at the end of each record. Type a character or select one of the following:

UNIX Newline (the default) null

(To implement a DOS newline, use the Record delimiter string property set to \R\N or choose Format as DOS line terminator from the shortcut menu.) Record delimiter is mutually exclusive with Record delimiter string, Record prefix, and Record type. Record length. Select Fixed where fixed length fields are being written. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes. This is not used by default (default files are commadelimited). The record is padded to the specified length with either zeros or the fill character if one has been specified. Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. It is set to 1 by default. This is mutually exclusive with Record delimiter, which is the default, and record delimiter string and record type. Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is written as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, VBS, or VR. This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix and by default is not used.

42-12

Parallel Job Developers Guide

Column Export Stage

Inputs Page

Field Defaults Defines default properties for columns written to the file or files. These are applied to all columns written, but can be overridden for individual columns from the Columns tab using the Edit Column Metadata dialog box. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are: Actual field length. Specifies the number of bytes to fill with the Fill character when a field is identified as null. When DataStage identifies a null field, it will write a field of this length full of Fill characters. This is mutually exclusive with Null field value. Delimiter. Specifies the trailing delimiter of all fields in the record. Type an ASCII character or select one of whitespace, end, none, null, comma, or tab.

whitespace. Whitespace characters at the end of a column are ignored, i.e., are not treated as part of the column. end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the same as a setting of None which is used for fields with fixed-width columns. none. No delimiter (used for fixed-width). null. ASCII Null character is used. comma. ASCII comma character is used. tab. ASCII tab character is used.

Delimiter string. Specify a string to be written at the end of each field. Enter one or more characters. This is mutually exclusive with Delimiter, which is the default. For example, specifying , (comma space you do not need to enter the inverted commas) would have each field delimited by , unless overridden for individual fields. Null field length. The length in bytes of a variable-length field that contains a null. When a variable-length field is written, DataStage writes a length value of null field length if the field contains a null. This property is mutually exclusive with null field value. Null field value. Specifies the value written to null field if the source is set to null. Can be a number, string, or C-type literal escape character. For example, you can represent a byte value by \ooo, where each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F You must use this . form to encode non-printable byte values. This property is mutually exclusive with Null field length and Actual length. For a fixed width data representation, you can use
Parallel Job Developers Guide 42-13

Inputs Page

Column Export Stage

Pad char (from the general section of Type defaults) to specify a repeated trailing character if the value you specify is shorter than the fixed width of the field. Prefix bytes. Specifies that each column in the data file is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the columns length or the tag value for a tagged field. You can use this option with variable-length fields. Variable-length fields can be either delimited by a character or preceded by a 1-, 2, or 4-byte prefix containing the field length. DataStage inserts the prefix before each field. This property is mutually exclusive with the Delimiter, Quote, and Final Delimiter properties, which are used by default. Print field. This property is not relevant for input links. Quote. Specifies that variable length fields are enclosed in single quotes, double quotes, or another character or pair of characters. Choose Single or Double, or enter a character. This is set to double quotes by default. When writing, DataStage inserts the leading quote character, the data, and a trailing quote character. Quote characters are not counted as part of a fields length. Vector prefix. For fields that are variable length vectors, specifies a 1-, 2-, or 4-byte prefix containing the number of elements in the vector. You can override this default prefix for individual vectors. Variable-length vectors must use either a prefix on the vector or a link to another field in order to specify the number of elements in the vector. If the variable length vector has a prefix, you use this property to indicate the prefix length. DataStage inserts the element count as a prefix of each variable-length vector field. By default, the prefix length is assumed to be one byte. Type Defaults These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type. General These properties apply to several data types (unless overridden at column level): Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

42-14

Parallel Job Developers Guide

Column Export Stage

Inputs Page

little-endian. The high byte is on the right. big-endian. The high byte is on the left. native-endian. As defined by the native format of the machine. This is the default.

Data Format. Specifies the data representation format of a field. Applies to fields of all data types except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that is neither string nor raw. Choose from:

binary text (the default)

A setting of binary has different meanings when applied to different data types:

For decimals, binary means packed. For other numerical data types, binary means not text . For dates, binary is equivalent to specifying the julian property for the date field. For time, binary is equivalent to midnight_seconds. For timestamp, binary specifies that the first integer contains a Julian day count for the date portion of the timestamp and the second integer specifies the time portion of the timestamp as the number of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.

By default data is formatted as text, as follows:

For the date data type, text specifies that the data to be written contains a text-based date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). For the decimal data type: a field represents a decimal in a string format with a leading space or '-' followed by decimal digits with an embedded decimal point if the scale is not zero. The destination string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored. For numeric fields (int8, int16, int32, uint8, uint16, uint32, sfloat, and dfloat): DataStage assumes that numeric fields are represented as text. For the time data type: text specifies that the field represents time in the text-based form %hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see NLS Guide).

Parallel Job Developers Guide

42-15

Inputs Page

Column Export Stage

For the timestamp data type: text specifies a text-based timestamp in the form %yyyy-%mm-%dd %hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see NLS Guide).

(See page 2-28 for a description of data types.) Field max width. The maximum number of bytes in a column represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width character set, you can calculate the length exactly. If you are using variable-length character set, calculate an adequate maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. Field width. The number of bytes in a field represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the number of bytes exactly. If its a variable length encoding, base your calculation on the width and frequency of your variable-width characters. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type. If you specify neither field width nor field max width, numeric fields written as text have the following number of bytes as their maximum width:

8-bit signed or unsigned integers: 4 bytes 16-bit signed or unsigned integers: 6 bytes 32-bit signed or unsigned integers: 11 bytes 64-bit signed or unsigned integers: 21 bytes single-precision float: 14 bytes (sign, digit, decimal point, 7 fraction, "E", sign, 2 exponent) double-precision float: 24 bytes (sign, digit, decimal point, 16 fraction, "E", sign, 3 exponent)

Pad char. Specifies the pad character used when strings or numeric values are written to an external string representation. Enter a character (single-byte for strings, can be multi-byte for ustrings) or choose null or space. The pad character is used when the external string representation is larger than required to hold the written field. In this case, the external string is filled with the pad character to its full length. Space is the default. Applies to string, ustring, and numeric data types and record, subrec, or tagged types if they contain at least one field of this type.

42-16

Parallel Job Developers Guide

Column Export Stage

Inputs Page

Character set. Specifies the character set. Choose from ASCII or EBCDIC. The default is ASCII. Applies to all data types except raw and ustring and record, subrec, or tagged containing no fields other than raw or ustring. String These properties are applied to columns with a string data type, unless overridden at column level. Export EBCDIC as ASCII. Select this to specify that EBCDIC characters are written as ASCII characters. Applies to fields of the string data type and record, subrec, or tagged fields if they contain at least one field of this type. Import ASCII as EBCDIC. Not relevant for input links. For ASCII-EBCDIC and EBCDIC-ASCII conversion tables, see DataStage Developers Help. Decimal These properties are applied to columns with a decimal data type unless overridden at column level. Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No. The default is No. Decimal separator. Specify the ASCII character that acts as the decimal separator (period by default). Packed. Select an option to specify what the decimal columns contain, choose from:

Yes to specify that the decimal columns contain data in packed decimal format (the default). This has the following subproperties:

Check. Select Yes to verify that data is packed, or No to not verify. Signed. Select Yes to use the existing sign when writing decimal columns. Select No to write a positive sign (0xf) regardless of the columns actual sign value.

No (separate) to specify that they contain unpacked decimal with a separate sign byte. This has the following sub-property:

Sign Position. Choose leading or trailing as appropriate.

No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This has the following subproperty:

Parallel Job Developers Guide

42-17

Inputs Page

Column Export Stage

Sign Position. Choose leading or trailing as appropriate.

No (overpunch) to specify that the field has a leading or end byte that contains a character which specifies both the numeric value of that byte and whether the number as a whole is negatively or positively signed. This has the following subproperty:

Sign Position. Choose leading or trailing as appropriate. Precision. Specifies the precision where a decimal column is written in text format. Enter a number. When a decimal is written to a string representation, DataStage uses the precision and scale defined for the source decimal field to determine the length of the destination string. The precision and scale properties override this default. When they are defined, DataStage truncates or pads the source decimal to fit the size of the destination string. If you have also specified the field width property, DataStage truncates or pads the source decimal to fit the size specified by field width. Rounding. Specifies how to round a decimal column when writing it. Choose from:

up (ceiling). Truncate source column towards positive infinity. This mode corresponds to the IEEE 754 Round Up mode. For example, 1.4 becomes 2, -1.6 becomes -1. down (floor). Truncate source column towards negative infinity. This mode corresponds to the IEEE 754 Round Down mode. For example, 1.6 becomes 1, -1.4 becomes -2. nearest value. Round the source column towards the nearest representable value. This mode corresponds to the COBOL ROUNDED mode. For example, 1.4 becomes 1, 1.5 becomes 2, 1.4 becomes -1, -1.5 becomes -2. truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, truncate to the scale size of the destination decimal. This mode corresponds to the COBOL INTEGER-PART function. Using this method 1.6 becomes 1, -1.6 becomes -1.

Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination. By default, when the DataStage writes a source decimal to a string representation, it uses the precision and scale defined for the source decimal field to determine the length of the destination string. You can override the default by means of the precision and scale properties. When you do, DataStage truncates or pads the

42-18

Parallel Job Developers Guide

Column Export Stage

Inputs Page

source decimal to fit the size of the destination string. If you have also specified the field width property, DataStage truncates or pads the source decimal to fit the size specified by field width. Numeric These properties apply to integer and float fields unless overridden at column level. C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a Clanguage format string used for writing integer or floating point strings. This is passed to sprintf(). For example, specifying a Cformat of %x and a field width of 8 ensures that integers are written as 8-byte hexadecimal strings. In_format. This property is not relevant for input links.. Out_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sprintf(). By default, DataStage invokes the C sprintf() function to convert a numeric field formatted as either integer or floating point data to a string. If this function does not output data in a satisfactory format, you can specify the out_format property to pass formatting arguments to sprintf(). Date These properties are applied to columns with a date data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text. Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system (see NLS Guide). Format string. The string format of a date. By default this is %yyyy-%mm-%dd. The Format string can contain one or a combination of the following elements:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff, for example %1970yy. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 366). %mmm: Three-character month abbreviation.

Parallel Job Developers Guide

42-19

Inputs Page

Column Export Stage

The format_string is subject to the following restrictions:


It cannot have more than one element of the same type, for example it cannot contain two %dd elements. It cannot have both %dd and %ddd. It cannot have both %yy and %yyyy. It cannot have both %mm and %ddd. It cannot have both %mmm and %ddd. It cannot have both %mm and %mmm. If it has %dd, it must have %mm or %mmm. It must have exactly one of %yy or %yyyy.

When you specify a date format string, prefix each component with the percent symbol (%). Separate the strings components with any character except the percent sign (%). If this format string does not include a day, it is set to the first of the month in the destination field. If the format string does not include the month and day, they default to January 1. Note that the format string must contain a month if it also contains a day; that is, you cannot omit only the month. The year_cutoff is the year defining the beginning of the century in which all two digit years fall. By default, the year cutoff is 1900; therefore, a two-digit year of 97 represents 1997. You can also set this using the environment variable APT_DATE_CENTURY_BREAK_YEAR (see "APT_DATE_CENTURY_BREAK_YEAR" in Parallel Job Advanced Developers Guide), but this is overridden by %year_cutoffyy if you have set it. You can specify any four-digit year as the year cutoff. All two-digit years then specify the next possible year ending in the specified two digits that is the same or greater than the cutoff. For example, if you set the year cutoff to 1930, the two-digit year 30 corresponds to 1930, and the two-digit year 29 corresponds to 2029. Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT. Time These properties are applied to columns with a time data type unless overridden at column level. All of these are incompatible with a Data Format setting of Text.
42-20 Parallel Job Developers Guide

Column Export Stage

Inputs Page

Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss. The possible components of the time format string are:

%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component. %ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent symbol. Separate the strings components with any character except the percent sign (%). Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight. Timestamp These properties are applied to columns with a timestamp data type unless overridden at column level. Format string. Specifies the format of a column representing a timestamp as a string. Defaults to %yyyy-%mm-%dd %hh:%nn:%ss. Specify the format as follows: For the date:

%dd: A two-digit day. %mm: A two-digit month. %year_cutoffyy: A two-digit year derived from yy and the specified four-digit year cutoff. %yy: A two-digit year derived from a year cutoff of 1900. %yyyy: A four-digit year. %ddd: Day of year in three-digit form (range of 1 - 366)

For the time:


%hh: A two-digit hours component. %nn: A two-digit minute component (nn represents minutes because mm is used for the month of a date). %ss: A two-digit seconds component.

Parallel Job Developers Guide

42-21

Outputs Page

Column Export Stage

%ss.n: A two-digit seconds plus fractional part, where n is the number of fractional digits with a maximum value of 6. If n is 0, no decimal point is printed as part of the seconds component. Trailing zeros are not suppressed.

You must prefix each component of the format string with the percent sign (%). Separate the strings components with any character except the percent sign (%).

Outputs Page
The Outputs page allows you to specify details about data output from the Column Export stage. The Column Export stage can have only one output link, but can also have a reject link carrying records that have been rejected. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Column Export stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output links. Details about Column Export stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

42-22

Parallel Job Developers Guide

Column Export Stage

Outputs Page

Mapping Tab
For the Column Export stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the input columns plus the composite column that the stage exports the specified input columns to. These are read only and cannot be modified on this tab. The right pane shows the output columns for each link. This has a Derivations field where you can specify how the column is derived.You can fill it in by dragging input columns over, or by using the Auto-match facility. In the example, the Key column is being passed straight through (it has not been defined as a Column to Export in the stage properties. The remaining columns are all being exported to comp_col, which is the specified Export Column. You could also pass the original columns through the stage, if required.

Reject Link
You cannot change the details of a Reject link. The link uses the column definitions for the link rejecting the data rows. Rows will be rejected if they do not match the expected schema.

Parallel Job Developers Guide

42-23

Using RCP With Column Export Stages

Column Export Stage

Using RCP With Column Export Stages


Runtime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between. You can only use RCP on Column Export stages if you have used the Schema File property (see "Schema File" on page 42-7) to specify a schema which describes all the columns in the column. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are: Sequential File File Set External Source External Target Column Import Column Export

42-24

Parallel Job Developers Guide

43
Make Subrecord Stage
The Make Subrecord stage is a restructure stage. It can have a single input link and a single output link. The Make Subrecord stage combines specified vectors in an input data set into a vector of subrecords whose columns have the names and data types of the original vectors. You specify the vector columns to be made into a vector of subrecords and the name of the new subrecord. See "Complex Data Types" on page 2-32 for an explanation of vectors and subrecords.
Input Data
Column 1 Column 2 Column 3 Column 4 Vector 1 0 Vector 2 0 Vector 3 0 Vector 4 0 1 1 1 1 2 2 2 2 3 3 3 4

Output Data

Column 1

Subrec
0 Vector1.0 1 Vector2.0 2 Vector3.0 3 Vector4.0

Subrec
0 Vector1.1 1 Vector2.1 2 Vector3.1 3 Vector4.1

Subrec
0 Vector1.2 1 Vector2.2 2 Vector3.2 3 Vector4.2

Subrec
0 Vector1.3 1 Vector2.3 2 Vector3.3 3 Pad

Subrec
0 Pad 1 Vector2.4 2 Pad 3 Pad

The Split Subrecord stage performs the inverse operation. See Chapter 43, "Make Subrecord Stage." The length of the subrecord vector created by this operator equals the length of the longest vector column from which it is created. If a variable-length vector column was used in subrecord creation, the subrecord vector is also of variable length.

Parallel Job Developers Guide

43-1

Make Subrecord Stage

Vectors that are smaller than the largest combined vector are padded with default values: NULL for nullable columns and the corresponding type-dependent value for non-nullable columns. When the Make Subrecord stage encounters mismatched vector lengths, it warns you by writing to the job log. You can also use the stage to make a simple subrecord rather than a vector of subrecords. If your input columns are simple data types rather than vectors, they will be used to build a vector of subrecords of length 1 effectively a simple subrecord.
Column 1 Keycol Colname1 Colname2 Colname3 Colname4

Input Data Output Data

Column 2 Column 3 Column 4 Column 5

Column 1

Subrec
Keycol Colname1 Colname2 Colname3 Colname4

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

43-2

Parallel Job Developers Guide

Make Subrecord Stage

Examples

Examples
This section gives examples of input and output data from a Make Subrecord stage to give you a better idea of how the stage works. In this example the Make Subrecord stage extracts data from four input columns, three of which carry vectors. The data is output in two columns, one carrying the vectors in a subrecord, and the non-vector column being passed through the stage. The example assumes that the job is running sequentially. The screenshot shows the column definitions for the input data set.

The following are the rows from the input data set (superscripts represents the vector index): row row row row row row row row row row Key A B C D E F G H I J acol 12013142643 2206142213 760 0152223 406181203 20416283 180815283 1201016213 120 0162433 30 0172823 50 0172023 bcol ccol Wills0wombat1bill2william3 D0 0 1 Robin0Dally1Rob2RD3 G0A1 0 1 2 3 Beth Betany Bethany Bets B071 Heathcliff0HC1Hchop2Horror3 A011 0 1 2 3 Chaz Swot Chazlet Twerp C0H1 kayser0Cuddles1KB2Ibn Kayeed3 M011 Jayne0Jane1J2JD3 F021 Ann0Anne1AK2AJK3 H0E1 0 1 2 3 Kath Cath Catherine Katy C0H1 Rupert0Rupe1Woopert2puss3 B0C1

The stage outputs the subrecord it builds from the input data in a single column called parent. The column called key will be output

Parallel Job Developers Guide

43-3

Examples

Make Subrecord Stage

separately. The following screenshot show the output column definitions:

The Properties of the Make Subrecord stage are set as follows:

The output data set will be: Key Vector Index 0 12 Wills D 22 Robin Parent 1 13 wombat 0 6 Dally 2 4 bill pad 4 Rob 4 64 william pad 21 RD

row

row

43-4

Parallel Job Developers Guide

Make Subrecord Stage

Must Dos

row

row

row

row

row

row

row

row

G 76 Beth B 4 Heathcliff A 2 Chaz C 18 Kayser M 12 Jayne F 12 Ann H 3 Kath C 5 Rupert B

A 0 Betany 7 6 HC 1 4 Swot H 8 Cuddles 1 10 Jane 2 0 Anne E 0 Cath H 0 Rupe C

pad pad 52 2 Bethany Bets pad pad 81 0 Hchop Horror pad pad 6 8 Chazlet Twerp pad pad 5 8 KB Ibn Kayeed pad pad 6 1 J JD pad pad 6 43 AK AJK pad pad 7 82 Catherine Katy pad pad 7 02 Woopert puss pad pad

Must Dos
DataStage has many defaults which means that it can be very easy to include Make Subrecord stages in a job. This section specifies the minimum steps to take to get a Make Subrecord stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Make Subrecord stage: In the Outputs Stage Properties Tab, under the Input category:

Specify the vector column to combine into the subrecord, repeat the property to specify multiple vector columns.

Under the Output category:

Specify the Subrecord output column.

Parallel Job Developers Guide

43-5

Stage Page

Make Subrecord Stage

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties that determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Options/Subrecord Output Column Options/Vector Column for Subrecord Options/Disable Warning of Column Padding

Values
Output Column Input Column

Default
N/A N/A

Mandatory?
Y N

Repeats?
N Y

Dependent of
N/A Key

True/False

False

N/A

Input Category
Subrecord Output Column Specify the name of the subrecord into which you want to combine the columns specified by the Vector Column for Subrecord property. Output Category Vector Column for Subrecord Specify the name of the column to include in the subrecord. You can specify multiple columns to be combined into a subrecord. For each column, specify the property followed by the name of the column to include. You can use the Column Selection dialog box to select multiple columns at once if required (see page 3-10).

43-6

Parallel Job Developers Guide

Make Subrecord Stage

Stage Page

Options Category
Disable Warning of Column Padding When the stage combines vectors of unequal length, it pads columns and displays a message to this effect. Optionally specify this property to disable display of the message.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Parallel Job Developers Guide

43-7

Inputs Page

Make Subrecord Stage

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Make Subrecord stage expects one incoming data set. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being converted. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Make Subrecord stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is converted. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. If the Make Subrecord stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Make Subrecord stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Make Subrecord stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type dropdown list. This will override any current partitioning. If the Make Subrecord stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method of the Make Subrecord stage. Entire. Each file written to receives the entire data set.
43-8 Parallel Job Developers Guide

Make Subrecord Stage

Inputs Page

Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag columns. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. This is the default partitioning method for the Make Subrecord stage. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Make Subrecord stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being converted. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default for the default Auto methods). Select the check boxes as follows:

Parallel Job Developers Guide

43-9

Outputs Page

Make Subrecord Stage

Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Make Subrecord stage. The Make Subrecord stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. See Chapter 3, "Stage Editors," for a general description of the tabs.

43-10

Parallel Job Developers Guide

44
Split Subrecord Stage
The Split Subrecord stage separates an input subrecord field into a set of top-level vector columns. It can have a single input link and a single output link. The stage creates one new vector column for each element of the original subrecord. That is, each top-level vector column that is created has the same number of elements as the subrecord from which it was created. The stage outputs columns of the same name and data type as those of the columns that comprise the subrecord.
Input Data
Column 1 Subrec
0 Vector1.0 1 Vector2.0 2 Vector3.0 3 Vector4.0

Subrec
0 Vector1.1 1 Vector2.1 2 Vector3.1 3 Vector4.1

Subrec
0 Vector1.2 1 Vector2.2 2 Vector3.2 3 Vector4.2

Subrec
0 Vector1.3 1 Vector2.3 2 Vector3.3 3 Pad

Subrec
0 Pad 1 Vector2.4 2 Pad 3 Pad

Output Data

Column 1 Column 2 Column 3 Column 4

Vector 1 0 Vector 2 0 Vector 3 0 Vector 4 0

1 1 1 1

2 2 2 2

3 3 3 4

Parallel Job Developers Guide

44-1

Examples

Split Subrecord Stage

The Make Subrecord stage performs the inverse operation (see Chapter 38, "SAS Stage.")

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Examples
This section gives examples of input and output data from a Split Subrecord stage to give you a better idea of how the stage works. In this example the Split Subrecord stage extracts data from a subrecord containing three vectors. The data is output in fours column, three carrying the vectors from the subrecord, plus another column which is passed through the stage. The example assumes that

44-2

Parallel Job Developers Guide

Split Subrecord Stage

Examples

the job is running sequentially. The screenshot shows the column definitions for the input data set.

The following are the rows from the input data set (superscripts represents the vector index): Key Vector Index row A Parent 0 1 12 13 Wills wombat D 0 22 6 Robin Dally G A 76 0 Beth Betany B 7 4 6 HeathcliffHC A 1 2 4 Chaz Swot C H 18 8 Kayser Cuddles M 1 12 10 Jayne Jane F 2 12 0 2 4 bill pad 4 Rob pad 52 Bethany pad 81 Hchop pad 6 Chazlet pad 5 KB pad 6 J pad 6 4 64 william pad 21 RD pad 2 Bets pad 0 Horror pad 8 Twerp pad 8 Ibn Kayeed pad 1 JD pad 43

row

row

row

row

row

row

row

Parallel Job Developers Guide

44-3

Examples

Split Subrecord Stage

row

row

Ann H 3 Kath C 5 Rupert B

Anne E 0 Cath H 0 Rupe C

AK AJK pad pad 7 82 CatherineKaty pad pad 7 02 Woopert puss pad pad

The stage outputs the data it extracts from the subrecord in three separate columns each carrying a vector. The column called key will be output separately. The following screenshot show the output column definitions:

44-4

Parallel Job Developers Guide

Split Subrecord Stage

Must Dos

The Properties of the Make Subrecord stage are set as follows:

The output data set will be (superscripts represents the vector index): row row row row row row row row row row Key A B C D E F G H I J acol 12013142643 2206142213 760 0152223 406181203 20416283 180815283 1201016213 120 0162433 30 0172823 50 0172023 bcol Wills0wombat1bill2william3 Robin0Dally1Rob2RD3 Beth0Betany1Bethany2Bets3 Heatchcliff0HC1Hchop2Horror3 Chaz0Swot1Chazlet2Twerp3 kayser0Cuddles1KB2Ibn Kayeed3 Jayne0Jane1J2JD3 Ann0Anne1AK2AJK3 Kath0Cath1Catherine2Katy3 Rupert0Rupe1Woopert2puss3 ccol D0 0 1 G0A1 B071 A011 C0H1 M011 F021 H0E1 C0H1 B0C1

Must Dos
DataStage has many defaults which means that it can be very easy to include Split Subrecord stages in a job. This section specifies the minimum steps to take to get a Split Subrecord stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Split Subrecord stage:

Parallel Job Developers Guide

44-5

Stage Page

Split Subrecord Stage

In the Outputs Stage Properties Tab, under the Input category:

Specify the subrecord column that the stage will extract vectors from.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. The Split Subrecord only has one property, and you must supply a value for this. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Options/Subrecord Column

Values
Input Column

Default
N/A

Mandatory? Repeats?
Y N

Dependent of
N/A

Options Category
Subrecord Column Specifies the name of the vector whose elements you want to promote to a set of similarly named top-level columns.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

44-6

Parallel Job Developers Guide

Split Subrecord Stage

Inputs Page

Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. There can be only one input to the Split Subrecord stage. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being converted. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Split Subrecord stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is converted. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of

Parallel Job Developers Guide

44-7

Inputs Page

Split Subrecord Stage

current and preceding stages and how many nodes are specified in the Configuration file. If the Split Subrecord stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Split Subrecord stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Split Subrecord stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type dropdown list. This will override any current partitioning. If the Split Subrecord stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Split Subrecord stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .

44-8

Parallel Job Developers Guide

Split Subrecord Stage

Inputs Page

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Split Subrecord stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being converted. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Parallel Job Developers Guide

44-9

Outputs Page

Split Subrecord Stage

Outputs Page
The Outputs page allows you to specify details about data output from the Split Subrecord stage. The Split Subrecord stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. See Chapter 3, "Stage Editors," for a general description of these tabs.

44-10

Parallel Job Developers Guide

45
Combine Records Stage
The Combine Records stage is restructure stage. It can have a single input link and a single output link. The Combine Records stage combines records (i.e., rows), in which particular key-column values are identical, into vectors of subrecords. As input, the stage takes a data set in which one or more columns are chosen as keys. All adjacent records whose key columns contain the same value are gathered into the same record as subrecords.
Column 1 Keycol Colname1 Colname2 Colname3 Colname4

Input Data

Column 2 Column 3 Column 4 Column 5

Output Data

Column 1

Subrec
Keycol Colname1 Colname2 Colname3 Colname4

Subrec
Keycol Colname1 Colname2 Colname3 Colname4

Subrec
Keycol Colname1 Colname2 Colname3 Colname4

Subrec
Keycol Colname1 Colname2 Colname3 Colname4

Subrec
Keycol Colname1 Colname2 Colname3 Colname4

Subrec
Keycol Colname1 Colname2 Colname3 Colname4

All elements share same value of Keycol

The data set input to the Combine Records stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. Choosing the (auto) partitioning method will ensure that partitioning and sorting is done. If sorting and partitioning are carried out on separate stages before the Combine Records stage, DataStage in auto mode will detect this and not repartition

Examples

Combine Records Stage

(alternatively you could explicitly specify the Same partitioning method).

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Examples
This section gives examples of input and output data from a Combine Records stage to give you a better idea of how the stage works.

45-2

Parallel Job Developers Guide

Combine Records Stage

Examples

Example 1
This example assumes that the job is running sequentially. The screenshot shows the column definitions for the input data set:

The following are some rows from the input data set: row row row row row row row row row row row row row row row col1 1 3 1 2 2 2 3 3 1 2 2 3 3 3 3 col2 00:11:01 08:45:54 12:59:01 07:33:04 12:00:00 07:37:04 07:56:03 09:58:02 11:43:02 01:30:01 11:30:01 10:28:02 12:27:00 06:33:03 11:18:22 col3 1960-01-02 1946-09-15 1955-12-22 1950-03-10 1967-02-06 1950-03-10 1977-04-14 1960-05-18 1980-06-03 1985-07-07 1985-07-07 1992-11-23 1929-08-11 1999-10-19 1992-11-23 keycol A A B B B B B B C C C C C C C

Once combined by the stage, each group of rows will be output in a single column called supercool. This contains the caecal, col1, col2, and col3 columns. (If you do not take advantage of the runtime column propagation feature, you would have to set up the subrecord

Parallel Job Developers Guide

45-3

Examples

Combine Records Stage

using the Edit Column Meta Data dialog box to set a level number for each of the columns the subrecord column contains.)

The Properties of the stage are set as follows:

The Output data set will be: subreccol vector index 0 1 0 col1 1 3 1 col2 00:11:01 08:45:54 12:59:01 col3 keycol 1960-01-02 A 1946-09-15 A 1955-12-22 B

row row

45-4

Parallel Job Developers Guide

Combine Records Stage

Examples

row

1 2 3 4 5 0 1 2 3 4 5 6

2 2 2 3 3 1 2 2 3 3 3 3

07:33:04 12:00:00 07:37:04 07:56:03 09:58:02 11:43:02 01:30:01 11:30:01 10:28:02 12:27:00 06:33:03 11:18:22

1950-03-10 1967-02-06 1950-03-10 1977-04-14 1960-05-18 1980-06-03 1985-07-07 1985-07-07 1992-11-23 1929-08-11 1999-10-19 1992-11-23

B B B B B C C C C C C C

Example 2
This example shows a more complex structure that can be derived using the Top Level Keys Property. This can be set to True to indicate that key columns should be left as top level columns and not included in the subrecord.This example assumes that the job is running sequentially. The same column definition are used, except both col1 and keycol are defined as keys:

The same input data set is used: row row row row row col1 1 3 1 2 2 col2 00:11:01 08:45:54 12:59:01 07:33:04 12:00:00 col3 1960-01-02 1946-09-15 1955-12-22 1950-03-10 1967-02-06 keycol A A B B B
45-5

Parallel Job Developers Guide

Examples

Combine Records Stage

row row row row row row row row row row

2 3 3 1 2 2 3 3 3 3

07:37:04 07:56:03 09:58:02 11:43:02 01:30:01 11:30:01 10:28:02 12:27:00 06:33:03 11:18:22

1950-03-10 1977-04-14 1960-05-18 1980-06-03 1985-07-07 1985-07-07 1992-11-23 1929-08-11 1999-10-19 1992-11-23

B B B C C C C C C C

The Output column definitions have two separate columns defined for the keys, as well as the column carrying the subrecords:

45-6

Parallel Job Developers Guide

Combine Records Stage

Must Dos

The properties of the stage are set as follows:

The Output data set will be: keycol col1 subreccol vector index col2 0 00:11:01 0 08:45:54 0 12:59:01 0 07:33:04 1 12:00:00 2 07:37:04 0 07:56:03 1 09:58:02 0 11:43:02 0 01:30:01 1 11:30:01 0 10:28:02 1 12:27:00 2 06:33:03 3 11:18:22

row row row row

A A B B

1 3 1 2

row row row row

B C C C

3 1 2 3

col3 1960-01-02 1946-09-15 1955-12-22 1950-03-10 1967-02-06 1950-03-10 1977-04-14 1960-05-18 1980-06-03 1985-07-07 1985-07-07 1992-11-23 1929-08-11 1999-10-19 1992-11-23

Must Dos
DataStage has many defaults which means that it can be very easy to include Combine Records stages in a job. This section specifies the minimum steps to take to get a Combine Records stage functioning. DataStage provides a versatile user interface, and there are many
Parallel Job Developers Guide 45-7

Stage Page

Combine Records Stage

shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Combine Records stage: In the Stage Page Properties Tab, under the Output category:

Specify the output column to carry the vector of subrecords in the Subrecord Output Column.

Under the Combine Keys category:

Specify the key column. Repeat the property to specify a composite key. All adjacent rows sharing the same value of the keys will be combined into a single row using the vector of subrecords structure.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Options/ Subrecord Output Column Options/Key Options/Case Sensitive

Values
Output Column

Default
N/A

Mandatory?
Y

Repeats?
N

Dependent of
N/A

Input Column True/False

N/A True

Y N

Y N

N/A Key

45-8

Parallel Job Developers Guide

Combine Records Stage

Stage Page

Category/ Property
Options/Top Level Keys

Values
True/False

Default
False

Mandatory?
N

Repeats?
N

Dependent of
N/A

Outputs Category
Subrecord Output Column Specify the name of the subrecord that the Combine Records stage creates.

Combine Keys Category


Key Specify one or more columns. You can use the Column Selection dialog box to select multiple columns at once if required (see page 3-10). All records whose key columns contain identical values are gathered into the same record as subrecords. If the Top Level Keys property is set to False, each column becomes the element of a subrecord. If the Top Level Keys property is set to True, the key column appears as a top-level column in the output record as opposed to in the subrecord. All non-key columns belonging to input records with that key column appear as elements of a subrecord in that key columns output record. Key has the following dependent property: Case Sensitive Use this to property to specify whether each key is case sensitive or not. It is set to True by default; for example, the values CASE and case would not be judged equivalent.

Options Category
Top Level Keys Specify whether to leave keys as top-level columns or have them put into the subrecord. False by default.

Parallel Job Developers Guide

45-9

Stage Page

Combine Records Stage

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

NLS Locale Tab


This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Combine Records stage uses this when it is determining the sort order for key columns. Select a locale from the list, or click the arrow

45-10

Parallel Job Developers Guide

Combine Records Stage

Inputs Page

button next to the list to use a job parameter or browse for a collate file.

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Combine Records stage expects one incoming data set. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being converted. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Combine Records stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is converted. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in

Parallel Job Developers Guide

45-11

Inputs Page

Combine Records Stage

the Configuration file.Auto mode ensures that data being input to the Combine Records stage is hash partitioned and sorted. If the Combine Records stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Combine Records stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Combine Records stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type dropdown list. This will override any current partitioning. If the Combine Records stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Combine Records stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .

45-12

Parallel Job Developers Guide

Combine Records Stage

Inputs Page

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Combine Records stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. In the case of a Combine Records stage, Auto will also ensure that the collected data is sorted. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being converted. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for

Parallel Job Developers Guide

45-13

Outputs Page

Combine Records Stage

partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Combine Records stage. The Combine Records stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. See Chapter 3, "Stage Editors," for a general description of the tabs.

45-14

Parallel Job Developers Guide

46
Promote Subrecord Stage
The Promote Subrecord stage is a restructure stage. It can have a single input link and a single output link. The Promote Subrecord stage promotes the columns of an input subrecord to top-level columns. The number of output columns equals the number of subrecord elements. The data types of the input subrecord columns determine those of the corresponding top-level columns.
Column 1 Parent (subrecord) Colname1 Colname2 Colname3 Colname4

Input Data
Column 1

Colname1 Colname2 Colname3 Colname4

Output Data

Column 2 Column 3 Column 4

Parallel Job Developers Guide

46-1

Examples

Promote Subrecord Stage

The stage can also promote the columns in vectors of subrecords, in which case it acts as the inverse of the Combine Subrecord stage (see Chapter 45).
Input Data
Column 1 Subrec
Keycol Colname1 Colname2 Colname3 Colname4

Subrec
Keycol Colname1 Colname2 Colname3 Colname4

Subrec
Keycol Colname1 Colname2 Colname3 Colname4

Subrec
Keycol Colname1 Colname2 Colname3 Colname4

Subrec
Keycol Colname1 Colname2 Colname3 Colname4

Subrec
Keycol Colname1 Colname2 Colname3 Colname4

Column 1

Keycol Colname1 Colname2 Colname3 Colname4

Output Data

Column 2 Column 3 Column 4 Column 5

The Combine Records stage performs the inverse operation. See Chapter 46, "Promote Subrecord Stage."

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Examples
This section gives examples of input and output data from a Promote Subrecord stage to give you a better idea of how the stage works.

46-2

Parallel Job Developers Guide

Promote Subrecord Stage

Examples

Example 1
In this example the Promote Subrecord stage promotes the records of a simple subrecord to top level columns. It extracts data from a single column containing a subrecord. The data is output in four columns, each carrying a column from the subrecord. The example assumes that the job is running sequentially. The screenshot shows the column definitions for the input data set.

The following are the rows from the input data set: Subrec subrecord column name row row row row col1 1 2 3 4 5 6 7 col2 AAD ABD CAD CCC BDD DAK MDB col3 Thurs Thurs Weds Mon Mon Fri Tues col4 No No Yes Yes Yes No Yes

The stage outputs the data it extracts from the subrecord in four

Parallel Job Developers Guide

46-3

Examples

Promote Subrecord Stage

separate columns of appropriate type. The following screenshot show the output column definitions:

The Properties of the Promote Subrecord stage are set as follows:

The output data set will be: row row row row Col1 1 2 3 4 Col2 AAD ABD CAD CCC Col3 Thurs Thurs Weds Mon Col4 No No Yes Yes

46-4

Parallel Job Developers Guide

Promote Subrecord Stage

Examples

Example 2
This example shows how the Promote Subrecord would operate on an aggregated vector of subrecords, as would be produced by the Combine Records stage. It assumes that the job is running sequentially. The screenshot shows the column definitions for the input data set:

The following are some rows from the input data set: subreccol vector index row 0 1 row 0 1 2 3 4 5 row 0 1 2 3 4 5 6

col1 1 3 1 2 2 2 3 3 1 2 2 3 3 3 3

col2 00:11:01 08:45:54 12:59:01 07:33:04 12:00:00 07:37:04 07:56:03 09:58:02 11:43:02 01:30:01 11:30:01 10:28:02 12:27:00 06:33:03 11:18:22

col3 1960-01-02 1946-09-15 1955-12-22 1950-03-10 1967-02-06 1950-03-10 1977-04-14 1960-05-18 1980-06-03 1985-07-07 1985-07-07 1992-11-23 1929-08-11 1999-10-19 1992-11-23

keycol A A B B B B B B C C C C C C C

Parallel Job Developers Guide

46-5

Examples

Promote Subrecord Stage

Once the columns in the subrecords have been promoted the data will be output in four columns as follows:

The properties of the stage are set as follows:

The Output data set will be: row row row row row row row row
46-6

col1 1 3 1 2 2 2 3 3

col2 00:11:01 08:45:54 12:59:01 07:33:04 12:00:00 07:37:04 07:56:03 09:58:02

col3 1960-01-02 1946-09-15 1955-12-22 1950-03-10 1967-02-06 1950-03-10 1977-04-14 1960-05-18

keycol A A B B B B B B
Parallel Job Developers Guide

Promote Subrecord Stage

Must Dos

row row row row row row row

1 2 2 3 3 3 3

11:43:02 01:30:01 11:30:01 10:28:02 12:27:00 06:33:03 11:18:22

1980-06-03 1985-07-07 1985-07-07 1992-11-23 1929-08-11 1999-10-19 1992-11-23

C C C C C C C

Must Dos
DataStage has many defaults which means that it can be very easy to include Promote Subrecord stages in a job. This section specifies the minimum steps to take to get a Promote Subrecord stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Promote Subrecord stage: In the Outputs Stage Properties Tab, under the Input category:

Specify the subrecord column that the stage will promote subrecords from.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Promote Subrecord Stage has one property:
Category/ Property
Options/ Subrecord Column

Values
Input Column

Default
N/A

Mandatory?
Y

Repeats?
N

Dependent of
N/A

Parallel Job Developers Guide

46-7

Inputs Page

Promote Subrecord Stage

Options Category
Subrecord Column Specifies the name of the subrecord whose elements will be promoted to top-level records.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combineability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Promote Subrecord stage expects one incoming data set.

46-8

Parallel Job Developers Guide

Promote Subrecord Stage

Inputs Page

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being converted. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Promote Subrecord stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is converted. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the Promote Subrecord stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Promote Subrecord stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Promote Subrecord stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Promote Subrecord stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the Promote Subrecord stage. Entire. Each file written to receives the entire data set.

Parallel Job Developers Guide

46-9

Inputs Page

Promote Subrecord Stage

Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. This is the default partitioning method for the Promote Subrecord stage. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Promote Subrecord stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being converted. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning methods chosen (it is not available with the default auto methods). Select the check boxes as follows:

46-10

Parallel Job Developers Guide

Promote Subrecord Stage

Outputs Page

Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Promote Subrecord stage. The Promote Subrecord stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. See Chapter 3, "Stage Editors," for a general description of the tabs.

Parallel Job Developers Guide

46-11

Outputs Page

Promote Subrecord Stage

46-12

Parallel Job Developers Guide

47
Make Vector Stage
The Make Vector stage is an active stage. It can have a single input link and a single output link. The Make Vector stage combines specified columns of an input data record into a vector of columns. The stage has the following requirements: The input columns must form a numeric sequence, and must all be of the same type. The numbers must increase by one. The columns must be named column_name0 to column_namen, where column_name starts the name of a column and 0 and n are the first and last of its consecutive numbers. The columns do not have to be in consecutive order. All these columns are combined into a vector of the same length as the number of columns (n+1). The vector is called column_name. Any input columns that do not have a name of that form will not be included in the vector but will be output as top level columns.
Column 1 Col0 Col1 Col2 Col3 Col4

Input Data Output Data

Column 2 Column 3 Column 4 Column 5

Column 1

Col

Col0

Col1

Col2

Col3

Col4

Parallel Job Developers Guide

47-1

Examples

Make Vector Stage

The Split Vector stage performs the inverse operation. See Chapter 48, "Split Vector Stage."

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Examples
This section gives examples of input and output data from a Make Vector stage to give you a better idea of how the stage works.

Example 1
In this example, all the input data will be included in the output vector. The example assumes that the job is running sequentially. The screenshot shows the column definitions for the input data set. Note

47-2

Parallel Job Developers Guide

Make Vector Stage

Examples

the columns all have the same type and names in the form column_nameN.:

The following are some rows from the input data set: row row row row row row row row row row Col0 3 3 7 4 1 0 9 0 1 7 Col1 6 2 8 8 6 1 9 8 7 9 Col2 2 7 8 7 2 6 6 4 2 4 Col3 9 2 5 1 5 7 4 4 5 7 Col4 9 4 3 6 1 8 2 3 3 8

The stage outputs the vectors it builds from the input data in a single column called column_name. You do not have to explicitly define the

Parallel Job Developers Guide

47-3

Examples

Make Vector Stage

output column name, DataStage will do this for you as the job runs, but you may wish to do so to make the job more understandable.

The properties of the stage are set as follows:

The output data set will be: Col Vector Index 0 row 3 row 3 row 7 row 4 row 1
47-4

1 6 2 8 8 6

2 2 7 8 7 2

3 9 2 5 1 5

4 9 4 3 6 1
Parallel Job Developers Guide

Make Vector Stage

Examples

row row row row row

0 9 0 1 7

1 9 8 7 9

6 6 4 2 4

7 4 4 5 7

8 2 3 3 8

Example 2
In this example, there are additional columns as well as the ones that will be included in the vector. The example assumes that the job is running sequentially. The screenshot shows the column definitions for the input data set, note the additional columns called name and code:

The following are some rows from the input data set: row row row row row row row row row row Name Code Col0 Will D070 3 Robin GA36 3 Beth B777 7 HeathcliffA100 4 Chaz CH01 1 Kayser CH02 0 Jayne M122 9 Ann F234 0 Kath HE45 1 Rupert BC11 7 Col1 6 2 8 8 6 1 9 8 7 9 Col2 2 7 8 7 2 6 6 4 2 4 Col3 9 2 5 1 5 7 4 4 5 7 Col4 9 4 3 6 1 8 2 3 3 8

The stage outputs the vectors it builds from the input data in a single column called column_name. The two other columns are output separately. You do not have to explicitly define the output column

Parallel Job Developers Guide

47-5

Examples

Make Vector Stage

names, DataStage will do this for you as the job runs, but you may wish to do so to make the job more understandable.

The properties of the stage are set as follows:

The output data set will be: Name Code Vector Index0 3 3 7 4 1 6 2 8 8 Col 2 2 7 8 7 3 9 2 5 1 4 9 4 3 6

row row row row


47-6

Will D070 Robin GA36 Beth B777 HeathcliffA100

Parallel Job Developers Guide

Make Vector Stage

Must Dos

row row row row row row

Chaz Kayser Jayne Ann Kath Rupert

CH01 CH02 M122 F234 HE45 BC11

1 0 9 0 1 7

6 1 9 8 7 9

2 6 6 4 2 4

5 7 4 4 5 7

1 8 2 3 3 8

Must Dos
DataStage has many defaults which means that it can be very easy to include Make Vector stages in a job. This section specifies the minimum steps to take to get a Make Vector stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Make Vector stage: In the Stage Page Properties Tab:

Specify the Columns Common Partial Name, this is the column_name part that is shared by all the columns in the input data set, which will be the name of the output column containing the vector.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Make Vector stage has one property:
Category/Property
Options/Columns Common Partial Name

Values
Name

Default
N/A

Mandatory?
Y

Repeats?
N

Dependent of
N/A

Parallel Job Developers Guide

47-7

Inputs Page

Make Vector Stage

Options Category
Columns Common Partial Name Specifies the beginning column_name of the series of consecutively numbered columns column_name0 to column_namen to be combined into a vector called column_name.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Make Vector stage expects one incoming data set.

47-8

Parallel Job Developers Guide

Make Vector Stage

Inputs Page

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being converted. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Make Vector stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is converted. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. If the Make Vector stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Make Vector stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Make Vector stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Make Vector stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

Parallel Job Developers Guide

47-9

Inputs Page

Make Vector Stage

Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. This is the default partitioning method for the Make Vector stage. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Make Vector stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being converted. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

47-10

Parallel Job Developers Guide

Make Vector Stage

Outputs Page

Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Make Vector stage. The Make Vector stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. See Chapter 3, "Stage Editors," for a general description of the tabs.

Parallel Job Developers Guide

47-11

Outputs Page

Make Vector Stage

47-12

Parallel Job Developers Guide

48
Split Vector Stage
The Split Vector stage is a restructure stage. It can have a single input link and a single output link. The Split Vector stage promotes the elements of a fixed-length vector to a set of similarly named top-level columns. The stage creates columns of the format name0 to namen, where name is the original vectors name and 0 and n are the first and last elements of the vector.
Input Data
Column 1 Col Col0 Col1 Col2 Col3 Col4

Column 1

Col0 Col1 Col2 Col3 Col4

Output Data

Column 2 Column 3 Column 4 Column 5

The Make Vector stage performs the inverse operation (see Chapter 42).

Parallel Job Developers Guide

48-1

Examples

Split Vector Stage

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Examples
This section gives examples of input and output data from a Split Vector stage to give you a better idea of how the stage works.

Example 1
In this example the input data comprises a single column carrying a vector. The example assumes that the job is running sequentially. The screenshot shows the column definitions for the input data set.

The following are some rows from the input data set: Col Vector Index 0 row 3 row 3 row 7 row 4
48-2

1 6 2 8 8

2 2 7 8 7

3 9 2 5 1

4 9 4 3 6
Parallel Job Developers Guide

Split Vector Stage

Examples

row row row row row row

1 0 9 0 1 7

6 1 9 8 7 9

2 6 6 4 2 4

5 7 4 4 5 7

1 8 2 3 3 8

The stage outputs the vectors it builds from the input data in a single column called column_name. You do not have to explicitly define the output column name, DataStage will do this for you as the job runs, but you may wish to do so to make the job more understandable.

The properties of the stage are set as follows:

The output data set will be:

Parallel Job Developers Guide

48-3

Examples

Split Vector Stage

row row row row row row row row row row

Col0 3 3 7 4 1 0 9 0 1 7

Col1 6 2 8 8 6 1 9 8 7 9

Col2 2 7 8 7 2 6 6 4 2 4

Col3 9 2 5 1 5 7 4 4 5 7

Col4 9 4 3 6 1 8 2 3 3 8

Example 2
In this example, there are additional columns as well as the ones containing the vector. The example assumes that the job is running sequentially. The screenshot shows the column definitions for the input data set, note the additional columns called name and code:

The following are some rows from the input data set: Name Code Vector Index 0 3 3 7 4 1 0 1 6 2 8 8 6 1 Col 2 2 7 8 7 2 6 3 9 2 5 1 5 7 4 9 4 3 6 1 8

row row row row row row

Will D070 Robin GA36 Beth B777 HeathcliffA100 Chaz CH01 Kayser CH02

48-4

Parallel Job Developers Guide

Split Vector Stage

Examples

row row row row

Jayne Ann Kath Rupert

M122 F234 HE45 BC11

9 0 1 7

9 8 7 9

6 4 2 4

4 4 5 7

2 3 3 8

The stage outputs the vectors it builds from the input data in a single column called column_name. The other columns are passed straight through. You do not have to explicitly define the output column name, DataStage will do this for you as the job runs, but you may wish to do so to make the job more understandable.

The properties of the stage are set as follows:

The output data set will be:

Parallel Job Developers Guide

48-5

Must Dos

Split Vector Stage

row row row row row row row row row row

Name Code Will D070 Robin GA36 Beth B777 HeathcliffA100 Chaz CH01 Kayser CH02 Jayne M122 Ann F234 Kath HE45 Rupert BC11

Col0 3 3 7 4 1 0 9 0 1 7

Col1 6 2 8 8 6 1 9 8 7 9

Col2 2 7 8 7 2 6 6 4 2 4

Col3 9 2 5 1 5 7 4 4 5 7

Col4 9 4 3 6 1 8 2 3 3 8

Must Dos
DataStage has many defaults which means that it can be very easy to include Split Vector stages in a job. This section specifies the minimum steps to take to get a Split Vector stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Split Vector stage: In the Stage Page Properties Tab:

Specify the name of the input column carrying the vector to be split.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Split Vector stage has one property:
Category/Property
Options/Vector Column

Values
Name

Default
N/A

Mandatory?
Y

Repeats?
N

Dependent of
N/A

48-6

Parallel Job Developers Guide

Split Vector Stage

Inputs Page

Options Category
Vector Column Specifies the name of the vector whose elements you want to promote to a set of similarly named top-level columns.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. There can be only one input to the Split Vector stage. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming

Parallel Job Developers Guide

48-7

Inputs Page

Split Vector Stage

data is partitioned before being converted. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Split Vector stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is converted. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the Split Vector stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Split Vector stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Split Vector stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Split Vector stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Split Vector stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

48-8

Parallel Job Developers Guide

Split Vector Stage

Inputs Page

Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Split Vector stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being converted. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default Auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

Parallel Job Developers Guide

48-9

Outputs Page

Split Vector Stage

Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Split Vector stage. The Split Vector stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the output link. See Chapter 3, "Stage Editors," for a general description of the tabs.

48-10

Parallel Job Developers Guide

49
Head Stage
The Head Stage is a Development/Debug stage. It can have a single input link and a single output link. It is one of a number of stages that DataStage provides to help you sample data, see also: Tail stage, Chapter 50. Sample stage, Chapter 51. Peek stage, Chapter 52. The Head Stage selects the first N rows from each partition of an input data set and copies the selected rows to an output data set. You determine which rows are copied by setting properties which allow you to specify: The number of rows to copy The partition from which the rows are copied The location of the rows to copy The number of rows to skip before the copying operation begins

This stage is helpful in testing and debugging applications with large data sets. For example, the Partition property lets you see data from a single partition to determine if the data is being partitioned as you want it to be. The Skip property lets you access a certain portion of a data set.

Parallel Job Developers Guide

49-1

Examples

Head Stage

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Examples
Head Stage Default Behavior
Our input data set comprises details of the inhabitants of Woodstock, Oxfordshire in the seventeenth century, which has previously been hash-partititioned into four partitions. We accept the default setting to sample ten rows from the start of each partition as follows:

49-2

Parallel Job Developers Guide

Head Stage

Examples

After the job is run we get a data set comprising four partitions each containing ten rows. Here is a sample of partition 0 as input to the Head stage, and partition 0 in its entirety as output by the stage:

Parallel Job Developers Guide

49-3

Must Dos

Head Stage

Skipping Data
In this example we are using the same data, but this time we are only interested in partition 0, and are skipping the first 100 rows before we take our ten rows. The Head stage properties are set as follows:

Here is the data set output by the stage:

Must Dos
DataStage has many defaults which means that it can be very easy to include Head stages in a job. This section specifies the minimum steps to take to get a Head stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a

49-4

Parallel Job Developers Guide

Head Stage

Stage Page

particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Head stage: In the Stage Page Properties Tab, under the Rows category:

Specify the number of rows per partition that you want to copy from the source data set to the target data set. This defaults to ten.

You can also:


Specify that the stage should skip the first N rows per partition. Specify that the stage will output all rows in a partition after the skip. Specify that the stage should output every Nth row.

Under the Partitions category:

Specify that the stage will only output rows from the selected partitions.

In the Outputs Page Mapping Tab, specify how the headed data maps onto your output columns.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Rows/All Rows Rows/Number of Rows (per Partition)

Values
True/False Count

Default
False 10

Mandatory? Repeats?
N N N N

Dependent of
N/A N/A

Parallel Job Developers Guide

49-5

Stage Page

Head Stage

Category/ Property
Rows/Period (per Partition) Rows/Skip (per Partition) Partitions/All Partitions Partitions/Partition Number

Values
Number Number Partition Number Number

Default
N/A N/A N/A N/A

Mandatory? Repeats?
N N N Y (if All Partitions = False) N N Y Y

Dependent of
N/A N/A N/A N/A

Rows Category
All Rows Copy all input rows to the output data set. You can skip rows before Head performs its copy operation by using the Skip property. The Number of Rows property is not needed if All Rows is true. Number of Rows (per Partition) Specify the number of rows to copy from each partition of the input data set to the output data set. The default value is 10. The Number of Rows property is not needed if All Rows is true. Period (per Partition) Copy every P th record in a partition, where P is the period. You can start the copy operation after records have been skipped by using the Skip property. P must equal or be greater than 1. Skip (per Partition) Ignore the first number of rows of each partition of the input data set, where number is the number of rows to skip. The default skip count is 0.

Partitions Category
All Partitions If False, copy records only from the indicated partition, specified by number. By default, the operator copies rows from all partitions.

49-6

Parallel Job Developers Guide

Head Stage

Inputs Page

Partition Number Specifies particular partitions to perform the Head operation on. You can specify the Partition Number property multiple times to specify multiple partition numbers.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Head stage expects one input. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being headed. The Columns tab specifies

Parallel Job Developers Guide

49-7

Inputs Page

Head Stage

the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Head stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is headed. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. If the Head stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Head stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Head stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Head stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Head stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

49-8

Parallel Job Developers Guide

Head Stage

Inputs Page

Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Head stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being headed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default.

Parallel Job Developers Guide

49-9

Outputs Page

Head Stage

Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Head stage. The Head stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Head stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output link. Details about Head stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

49-10

Parallel Job Developers Guide

Head Stage

Outputs Page

Mapping Tab
For the Head stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the input columns and/or the generated columns. These are read only and cannot be modified on this tab. The right pane shows the output columns for each link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Automatch facility.

Parallel Job Developers Guide

49-11

Outputs Page

Head Stage

49-12

Parallel Job Developers Guide

50
Tail Stage
The Tail Stage is a Development/Debug stage. It can have a single input link and a single output link. It is one of a number of stages that DataStage provides to help you sample data, see also: Head stage, Chapter 49. Sample stage, Chapter 51. Peek stage, Chapter 52. The Tail Stage selects the last N records from each partition of an input data set and copies the selected records to an output data set. You determine which records are copied by setting properties which allow you to specify: The number of records to copy The partition from which the records are copied This stage is helpful in testing and debugging applications with large data sets. For example, the Partition property lets you see data from a single partition to determine if the data is being partitioned as you want it to be. The Skip property lets you access a certain portion of a data set.

The stage editor has three pages:

Parallel Job Developers Guide

50-1

Examples

Tail Stage

Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Examples
Our input data set comprises details of the inhabitants of Woodstock, Oxfordshire in the seventeenth century, which has previously been hash-partititioned into four partitions. We accept the default setting to sample ten rows from the end of each partition as follows:

50-2

Parallel Job Developers Guide

Tail Stage

Must Dos

After the job is run we get a data set comprising four partitions each containing ten rows. Here is a sample of partition 0 as input to the Tail stage, and partition 0 in its entirety as output by the stage:

Must Dos
DataStage has many defaults which means that it can be very easy to include Tail stages in a job. This section specifies the minimum steps to take to get a Tail stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular

Parallel Job Developers Guide

50-3

Stage Page

Tail Stage

end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Tail stage: In the Stage Page Properties Tab, under the Rows category:

Specify the number of rows per partition that you want to copy from the source data set to the target data set. This defaults to ten.

Under the Partitions category:

Specify that the stage will only output rows from the selected partitions.

In the Outputs Page Mapping Tab, specify how the tailed data maps onto your output columns.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Rows/Number of Rows (per Partition) Partitions/All Partitions Partitions/Partition Number

Values
Count Partition Number Number

Default
10 N/A N/A

Mandatory?
N N Y (if All Partitions = False)

Repeats?
N Y Y

Dependent of
Key N/A N/A

50-4

Parallel Job Developers Guide

Tail Stage

Stage Page

Rows Category
Number of Rows (per Partition) Specify the number of rows to copy from each partition of the input data set to the output data set. The default value is 10.

Partitions Category
All Partitions If False, copy records only from the indicated partition, specified by number. By default, the operator copies records from all partitions. Partition Number Specifies particular partitions to perform the Tail operation on. You can specify the Partition Number property multiple times to specify multiple partition numbers.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and
Parallel Job Developers Guide 50-5

Inputs Page

Tail Stage

selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Tail stage expects one input. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being tailed. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Tail stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is tailed. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will warn if it cannot preserve the partitioning of the incoming data. If the Tail stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Tail stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Tail stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning.

50-6

Parallel Job Developers Guide

Tail Stage

Inputs Page

If the Tail stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Tail stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button . Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for Tail stages. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

Parallel Job Developers Guide

50-7

Outputs Page

Tail Stage

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being tailed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen. Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
The Outputs page allows you to specify details about data output from the Tail stage. The Tail stage can have only one output link. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Tail stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output link. Details about Tail stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

50-8

Parallel Job Developers Guide

Tail Stage

Outputs Page

Mapping Tab
For the Tail stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the input columns and/or the generated columns. These are read only and cannot be modified on this tab. The right pane shows the output columns for each link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Automatch facility.

Parallel Job Developers Guide

50-9

Outputs Page

Tail Stage

50-10

Parallel Job Developers Guide

51
Sample Stage
The Sample stage is a Development/Debug stage. It can have a single input link and any number of output links when operationg in percent mode, or a single input and single output link when operating in period mode. It is one of a number of stages that DataStage provides to help you sample data, see also: Head stage, Chapter 49. Tail stage, Chapter 50. Peek stage, Chapter 52. The Sample stage samples an input data set. It operates in two modes. In Percent mode, it extracts rows, selecting them by means of a random number generator, and writes a given percentage of these to each output data set. You specify the number of output data sets, the percentage written to each, and a seed value to start the random number generator. You can reproduce a given distribution by repeating the same number of outputs, the percentage, and the seed value.

Parallel Job Developers Guide

51-1

Examples

Sample Stage

In Period mode, it extracts every Nth row from each partition, where N is the period, which you supply. In this case all rows will be output to a single data set, so the stage used in this mode can only have a single output link

For both modes you can specify the maximum number of rows that you want to sample from each partition. The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Input Page. This is where you specify details about the data set being Sampled. Outputs Page. This is where you specify details about the Sampled data being output from the stage.

Examples
Sampling in Percent Mode
Our input data set comprises details of the inhabitants of Woodstock, Oxfordshire in the seventeenth century, which has previously been hash-partititioned into four partitions. We are going to take three

51-2

Parallel Job Developers Guide

Sample Stage

Examples

samples, one of 10%, one of 5%, and one of 15%, and write these to three different data sets. The job to do this is as follows:

In the Stage page Properties tab we specify which percentages are written to which outputs as follows:

Parallel Job Developers Guide

51-3

Examples

Sample Stage

We use the Link Ordering tab to specify which outputs relate to which output links:

When we run the job we end up with three data sets of different sizes this is illustrated by using the data set manager tool to look at the data sets size and shape:

51-4

Parallel Job Developers Guide

Sample Stage

Examples

10 percent sample

5 percent sample

Parallel Job Developers Guide

51-5

Examples

Sample Stage

15 percent sample

Sampling in Period Mode


In this example we are going to extract every twentieth row from each partition, up to a maximum of forty rows (in paractice our example data set is not large enough to reach this maximum). In period mode, you are limited to sampling into a single data set. Here is the job that performs the period sample:

51-6

Parallel Job Developers Guide

Sample Stage

Must Dos

In the Stage page Properties tab we specify the period sample as follows:

Because there can only be one output from the stage, we do not need to bother with the Link Ordering tab. When we run the job it produces a sample from each partion. Here is the data sampled from partition 0:

Must Dos
DataStage has many defaults which means that it can be very easy to include Sample stages in a job. This section specifies the minimum steps to take to get a Sample stage functioning. DataStage provides a
Parallel Job Developers Guide 51-7

Stage Page

Sample Stage

versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Sample stage: In the Stage Page Properties Tab, choose the sample mode. This is Percent by default, but you can also choose Period. If you have chosen Percent, Specify the sampling percentage for an output link, and specify the output link number it will be output on (links are numbered from 0). Repeat these properties to specify the percentage for each of your output links. If you have chosen the Period mode, specify the Period. This will sample every Nth row in each partition. If you have chosen Percent mode, in the Stage Page Link Ordering Tab, specify which of your actual output links corresponds to link 0, link 1 etc. In the Outputs Page Mapping Tab, specify how output columns on each link are derived from the columns of the input data.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link Ordering tab allows you to specify which output links are which.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Options/Sample Mode

Values
percent/period

Default
percent

Mandatory?
Y

Repeats? Dependent of
N N/A

51-8

Parallel Job Developers Guide

Sample Stage

Stage Page

Category/ Property
Options/Percent

Values
number

Default
N/A

Mandatory?
Y (if Sample Mode = Percent) Y N

Repeats? Dependent of
Y N/A

Options/Output Link Number Options/Seed Options/Period (Per Partition) Options/Max Rows Per Partition

number number number number

N/A N/A N/A N/A

N N

Percent N/A N/A N/A

Y (if Sample N Mode = Period) N N

Options Category
Sample Mode Specifies the type of sample operation. You can sample on a percentage of input rows (percent), or you can sample the Nth row of every partition (period). Percent Specifies the sampling percentage for each output data set when use a Sample Mode of Percent. You can repeat this property to specify different percentages for each output data set. The sum of the percentages specified for all output data sets cannot exceed 100%. You can specify a job parameter if required. Percent has a dependent property: Output Link Number This specifies the output link to which the percentage corresponds. You can specify a job parameter if required. Seed This is the number used to initialize the random number generator. You can specify a job parameter if required. This property is only available if Sample Mode is set to percent. Period (Per Partition) Specifies the period when using a Sample Mode of Period.

Parallel Job Developers Guide

51-9

Stage Page

Sample Stage

Max Rows Per Partition This specifies the maximum number of rows that will be sampled from each partition.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request the next stage should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

51-10

Parallel Job Developers Guide

Sample Stage

Input Page

Link Ordering Tab


In Percent mode, this tab allows you to specify the order in which the output links are processed. This is how they correspond to the Output Link Number properties on the Properties Tab.

By default the output links will be processed in the order they were added. To rearrange them, choose an output link and click the up arrow button or the down arrow button.

Input Page
The Input page allows you to specify details about the data set being sampled. There is only one input link. The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data. Details about Sample stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning on Input Links


The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the sample is performed.
Parallel Job Developers Guide 51-11

Input Page

Sample Stage

By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on the previous stage in the job, the stage will warn if it cannot preserve the partitioning of the incoming data. If the Sample stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Sample stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Sample stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Sample stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the Sample stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .

51-12

Parallel Job Developers Guide

Sample Stage

Input Page

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for the Sample stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the sample is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Parallel Job Developers Guide

51-13

Outputs Page

Sample Stage

Outputs Page
The Outputs page allows you to specify details about data output from the Sample stage. In Percent mode, the stage can have any number of output links, in Period mode it can only have one output. Choose the link you want to work on from the Output Link drop down list. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of outgoing data. The Mapping tab allows you to specify the relationship between the columns being input to the Sample stage and the output columns. The Advanced tab allows you to change the default buffering settings for the output links. The Advanced tab allows you to change the default buffering settings for the input link. Details about Sample stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Mapping Tab
For Sample stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them.

The left pane shows the columns of the sampled data. These are read only and cannot be modified on this tab. This shows the meta data from the incoming link The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Automatch facility.

51-14

Parallel Job Developers Guide

Sample Stage

Outputs Page

In the above example the left pane represents the incoming data after the Sample operation has been performed. The right pane represents the data being output by the stage after the Sample operation. In this example the data has been mapped straight across.

Parallel Job Developers Guide

51-15

Outputs Page

Sample Stage

51-16

Parallel Job Developers Guide

52
Peek Stage
The Peek stage is a Development/Debug stage. It can have a single input link and any number of output links. The Peek stage lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets. Like the Head stage (Chapter 49) and the Tail stage (Chapter 50), the Peek stage can be helpful for monitoring the progress of your application or to diagnose a bug in your application.

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is where you specify the details about the single input set from which you are selecting records. Outputs Page. This is where you specify details about the processed data being output from the stage.

Must Dos

Peek Stage

Must Dos
DataStage has many defaults which means that it can be very easy to include Peek stages in a job. This section specifies the minimum steps to take to get a Peek stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use a Peek stage: In the Stage Page Properties Tab, check that the default settings are suitable for your requirements. In the Stage Page Link Ordering Tab, if you have chosen to output peeked records to a link rather than the job log, choose which output link will carry the peeked records.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Rows/All Records (After Skip) Rows/Number of Records (Per Partition) Rows/Period (per Partition)

Values
True/False number

Default
False 10

Mandatory?
N Y

Repeats?
N N

Dependent of
N/A N/A

Number

N/A

N/A

52-2

Parallel Job Developers Guide

Peek Stage

Stage Page

Category/ Property
Rows/Skip (per Partition) Columns/Peek All Input Columns Columns/Input Column to Peek Partitions/All Partitions Partitions/Partition Number Options/Peek Records Output Mode Options/Show Column Names Options/Delimiter String

Values
Number True/False Input Column

Default
N/A True N/A

Mandatory?
N Y Y (if Peek All Input Columns = False) Y Y (if All Partitions = False) N

Repeats?
N N Y

Dependent of
N/A N/A N/A

True/False number

True N/A

N Y

N/A N/A

Job Log/ Output True/False space/nl/tab

Job Log

N/A

True space

N N

N N

N/A N/A

Rows Category
All Records (After Skip) True to print all records from each partition. Set to False by default. Number of Records (Per Partition) Specifies the number of records to print from each partition. The default is 10. Period (per Partition) Print every Pth record in a partition, where P is the period. You can start the copy operation after records have been skipped by using the Skip property. P must equal or be greater than 1. Skip (per Partition) Ignore the first number of rows of each partition of the input data set, where number is the number of rows to skip. The default skip count is 0.

Parallel Job Developers Guide

52-3

Stage Page

Peek Stage

Columns Category
Peek All Input Columns True by default and prints all the input columns. Set to False to specify that only selected columns will be printed and specify these columns using the Input Column to Peek property. Input Column to Peek If you have set Peek All Input Columns to False, use this property to specify a column to be printed. Repeat the property to specify multiple columns.

Partitions Category
All Partitions Set to True by default. Set to False to specify that only certain partitions should have columns printed, and specify which partitions using the Partition Number property. Partition Number If you have set All Partitions to False, use this property to specify which partition you want to print columns from. Repeat the property to specify multiple columns.

Options Category
Peek Records Output Mode Specifies whether the output should go to an output column (the Peek Records column) or to the job log. Show Column Names If True, causes the stage to print the column name, followed by a colon, followed by the column value. If False, the stage prints only the column value, followed by a space. It is True by default. Delimiter String The string to use as a delimiter on columns. Can be space, tab or newline. The default is space.

52-4

Parallel Job Developers Guide

Peek Stage

Stage Page

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Parallel Job Developers Guide

52-5

Inputs Page

Peek Stage

Link Ordering Tab


This tab allows you to specify which output link carries the peek records data set if you have chosen to output the records to a link rather than the job log.

By default the last link added will represent the peek data set. To rearrange the links, choose an output link and click the up arrow button or the down arrow button.

Inputs Page
The Inputs page allows you to specify details about the incoming data sets. The Peek stage expects one incoming data set. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being peeked. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Peek stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is peeked. It also allows you to specify that the data should be sorted before being operated on.

52-6

Parallel Job Developers Guide

Peek Stage

Inputs Page

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will warn if it cannot preserve the partitioning of the incoming data. If the Peek stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Peek stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Peek stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Peek stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method of the Peek stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .

Parallel Job Developers Guide

52-7

Inputs Page

Peek Stage

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for Peek stages. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being peeked. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

52-8

Parallel Job Developers Guide

Peek Stage

Outputs Page

Outputs Page
The Outputs page allows you to specify details about data output from the Peek stage. The Peek stage can have any number of output links. Select the link whose details you are looking at from the Output name drop-down list. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Peek stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output links. Details about Peek stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Mapping Tab
For the Peek stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the columns being peeked. These are read only and cannot be modified on this tab. The right pane shows the output columns for each link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Automatch facility.

Parallel Job Developers Guide

52-9

Outputs Page

Peek Stage

52-10

Parallel Job Developers Guide

53
Row Generator Stage
The Row Generator stage is a Development/Debug stage. It has no input links, and a single output link. The Row Generator stage produces a set of mock data fitting the specified meta data. This is useful where you want to test your job but have no real data available to process. (See also the Column Generator stage which allows you to add extra columns to existing data sets, Chapter 54.) The meta data you specify on the output link determines the columns you are generating.

The stage editor has two pages: Stage Page. This is always present and is used to specify general information about the stage. Outputs Page. This is where you specify details about the generated data being output from the stage.

Examples

Row Generator Stage

Examples
Using a Row Generator Stage in Default Mode
In this example we are going to allow the Row Generator stage to generate a data set using default settings for the data types. The only change we make is to ask for 100 rows to be generated, rather than the default ten. We do this in the Outputs Page Properties tab:

We need to tell the stage how many columns in the generated data set and what type each column has. We do this in the Output page Columns tab:

53-2

Parallel Job Developers Guide

Row Generator Stage

Examples

When we run the job, DataStage generates the following data set:

We can see from this the type of that is generated by default. For example, for date fields, the first row has January 1st 1960, and this is incremented by one day for each subsequent row. We can specify more details about each data type if required to shape the data being generated.

Example of Specifying Data to be Generated


You can specify more details about the type of data being generated from the Edit Column Meta Data dialog box. This is accessed from

Parallel Job Developers Guide

53-3

Examples

Row Generator Stage

the Edit row shortcut menu for individual column definitions on the Outputs page Columns tab.

The Edit Column Meta Data dialog box contains different options for each data type. The possible options are described in "Generator" on page 3-40. We can use the Next> and <Previous buttons to go through all our columns.

Using this dialog box we specify the following for the generated data:

53-4

Parallel Job Developers Guide

Row Generator Stage

Examples

string

Algorithm = cycle seven separate Values (assorted animals). Epoch = 1958-08-18 Type = cycle Increment = 10 Scale factor = 60 Type = cycle Increment = 1 Epoch = 1958-08-18 Scale factor = 60 Type = cycle Increment = 1 Type = cycle Initial value = 300 Increment = 10 Limit = 3000 Percent invalid = 20 Percent zeros = 20 Type = random Seed=200 Type = cycle Increment = 10

date

time

timestamp

integer

decimal

float

Parallel Job Developers Guide

53-5

Examples

Row Generator Stage

Here is the data generated by these settings, compare this with the data generated by the default settings.

Example of Generating Data in Parallel


By default the Row Generator stage runs sequenially, generating data in a single partition. You can, however, configure it to run in parallel, and you can use the partition number when you are generating data to, for example, increment a value by the number of partitions. You will also get the Number of Records you specify in each partition (so in our example where we have asked for 100 records, you will get 100 records in each partition rather than 100 records divided between the number of partitions). In this example we are generating a data set comprising two integers. One is generated by cycling, one by random number generation. The cycling integers initial value is set to the partition number (using the special value part) and its increment is set to the number of partitions (using the special value partcount). This is set in the Edit

53-6

Parallel Job Developers Guide

Row Generator Stage

Must Dos

Column Meta Data dialog box as follows (select column in Columns tab and choose Edit Row from shortcut menu):

The random integers seed value is set to the partition number, and the limit to the total number of partitions. When we run this job in parallel, on a system with four nodes, the data generated in partition 0 is as follows:

Must Dos
DataStage has many defaults which means that it can be very easy to include Row Generator stages in a job. This section specifies the minimum steps to take to get a Row Generator stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic

Parallel Job Developers Guide

53-7

Stage Page

Row Generator Stage

method, you will learn where the shortcuts are when you get familiar with the product. To use a Row Generator stage: In the Stage Page Properties Tab, specify the Number of Records you want to generate. Specify the meta data for the rows you want to generate. You can do this either in the Output Page Columns Tab, or by specifying a schema file using the Schema File property on the Stage Page Properties Tab.

Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The Generate stage executes in Sequential mode by default. You can select Parallel mode to generate data sets in separate partitions. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. If you have an input data set, it adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request the next stage should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

53-8

Parallel Job Developers Guide

Row Generator Stage

Outputs Page

Outputs Page
The Outputs page allows you to specify details about data output from the Row Generator stage. The General tab allows you to specify an optional description of the output link. The Properties tab lets you specify what the stage does. The Columns tab specifies the column definitions of outgoing data. The Advanced tab allows you to change the default buffering settings for the output link.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/Property Values
Options/Number of Records Options/Schema File number pathname

Default
10 N/A

Mandatory?
Y N

Repeats?
N N

Dependent of
N/A N/A

Options Category
Number of Records The number of records you want your generated data set to contain. The default number is 10. Schema File By default the stage will take the meta data defined on the output link to base the mock data set on. But you can specify the column definitions in a schema file, if required. You can browse for the schema file or specify a job parameter.

Parallel Job Developers Guide

53-9

Outputs Page

Row Generator Stage

53-10

Parallel Job Developers Guide

54
Column Generator Stage
The Column Generator stage is a Development/Debug stage. It can have a single input link and a single output link. The Column Generator stage adds columns to incoming data and generates mock data for these columns for each data row processed. The new data set is then output. (See also the Row Generator stage which allows you to generate complete sets of mock data, Chapter 53.)

The stage editor has three pages: Stage Page. This is always present and is used to specify general information about the stage. Input Page. This is where you specify details about the input link. Outputs Page. This is where you specify details about the generated data being output from the stage.

Example
For our example we are going to generate an extra column for a data set containing a list of seventeenth-century inhabitants of Woodstock,

Parallel Job Developers Guide

54-1

Example

Column Generator Stage

Oxfordshire. The exta column will contain a unique id for each row. Here is the job that will do this:

The columns for the data input to the Column Generator stage is as follows:

54-2

Parallel Job Developers Guide

Column Generator Stage

Example

We set the Column Generator properties to add an extra column called uniqueid to our data set as follows:

The new column now appears on the Outputs page Mapping tab and can be mapped across to the output link (so it appears on the Outputs page Columns tab):

In this example we select the uniqueid column on the Outputs page Columns tab, then choose Edit Row from the shortcut menu. The Edit Column Meta Data dialog box appears and lets us specify more details about the data that will be generated for the new column.

Parallel Job Developers Guide

54-3

Example

Column Generator Stage

First we change the type from the default of char to integer. Because we are running the job in parallel, we want to ensure that the id we are generating will be unique across all partitions, to do this we set the initial value to the partition number (using the special value part) and the increment to the number of partions (using the special partcount):

When we run the job in parallel on a four-node system the stage will generate the uniqueid column for each row. Here are samples of

54-4

Parallel Job Developers Guide

Column Generator Stage

Must Dos

partion 0 and partition 1 to show how the unique number is generated:

Must Dos
DataStage has many defaults which means that it can be very easy to include Column Generator stages in a job. This section specifies the minimum steps to take to get a Column Generator stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic

Parallel Job Developers Guide

54-5

Stage Page

Column Generator Stage

method, you will learn where the shortcuts are when you get familiar with the product. To use a Column Generator stage: In the Stage Page Properties Tab, specify the Column Method. This is explicit by default, which means that you should specify the meta data for the columns you want to generate on the Outputs Page Columns Tab. If you use the Explicit method, you also need to specify which of the output link columns you are generating in the Column to Generate property. You can repeat this property to specify multiple columns. If you use the Schema File method, you should specify the schema file. Ensure you have specified the meta data for the columns you want to add. If you have specified a Column Method of explicit, you should do this on the Outputs Page Columns Tab. If you have specified a Column Method of Schema File, you should specify a schema file. In the Outputs Page Mapping Tab, specify how the incoming columns and generated columns map onto the output columns.

Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Options/Column Method

Values
Explicit/ Column Method

Default
Explicit

Mandatory?
Y

Repeats? Dependent of
N N/A

54-6

Parallel Job Developers Guide

Column Generator Stage

Stage Page

Category/ Property
Options/Column to Generate

Values
output column

Default
N/A

Mandatory?
Y

Repeats? Dependent of
Y (if Column Method = Explicit) Y (if Column Method = Schema File) N/A

Options/Schema File

pathname

N/A

N/A

Options Category
Column Method Select Explicit if you are going to specify the column or columns you want the stage to generate data for. Select Schema File if you are supplying a schema file containing the column definitions. Column to Generate When you have chosen a column method of Explicit, this property allows you to specify which output columns the stage is generating data for. Repeat the property to specify multiple columns. You can specify the properties for each column using the Parallel tab of the Edit Column Meta Dialog box (accessible from the shortcut menu on the columns grid of the output Columns tab). You can use the Column Selection dialog box to specify several columns at once if required (see page 3-10). Schema File When you have chosen a column method of schema file, this property allows you to specify the column definitions in a schema file. You can browse for the schema file or specify a job parameter.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by

Parallel Job Developers Guide

54-7

Input Page

Column Generator Stage

any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Propagate by default. If you have an input data set, it adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request the next stage should attempt to maintain the partitioning. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Input Page
The Inputs page allows you to specify details about the incoming data set you are adding generated columns to. There is only one input link and this is optional. The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Generate stage partitioning are given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Partitioning on Input Links


The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the generate is performed.

54-8

Parallel Job Developers Guide

Column Generator Stage

Input Page

By default the stage uses the auto partitioning method. If the Column Generator stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: Whether the Column Generator stage is set to execute in parallel or sequential mode. Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the Column Generator stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the Column Generator stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the Column Generator stage. Entire. Each file written to receives the entire data set. Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. Random. The records are partitioned randomly, based on the output of a random number generator. Round Robin. The records are partitioned on a round robin basis as they enter the stage. Same. Preserves the partitioning already in place. DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button .

Parallel Job Developers Guide

54-9

Input Page

Column Generator Stage

Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button . The following Collection methods are available: (Auto). This is the default collection method for the Column Generator stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operation starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the column generate operation is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for

54-10

Parallel Job Developers Guide

Column Generator Stage

Outputs Page

partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Outputs Page
Details about Column Generator stage mapping is given in the following section. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Mapping Tab
For Column Generator stages the Mapping tab allows you to specify how the output columns are derived, i.e., how the generated data maps onto them.

The left pane shows the generated columns. These are read only and cannot be modified on this tab. These columns are automatically mapped onto the equivalent output columns. The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived.You can fill it in by dragging input columns over, or by using the Auto-match facility. The right pane represents the data being output by the stage after the generate operation. In the above example two columns belong to incoming data and have automatically been mapped through and the two generated columns have been mapped straight across.

Parallel Job Developers Guide

54-11

Outputs Page

Column Generator Stage

54-12

Parallel Job Developers Guide

55
Write Range Map Stage
The Write Range Map stage is a Development/Debug stage. It allows you to write data to a range map. The stage can have a single input link. It can only run in sequential mode. The Write Range Map stage takes an input data set produced by sampling and sorting a data set and writes it to a file in a form usable by the range partitioning method. The range partitioning method uses the sampled and sorted data set to determine partition boundaries. See "Partitioning, Repartitioning, and Collecting Data" on page 2-7 for a description of the range partitioning method. A typical use for the Write Range Map stage would be in a job which used the Sample stage to sample a data set, the Sort stage to sort it and the Write Range Map stage to write the range map which can then be used with the range partitioning method to write the original data set to a file set.

The Write Range Map stage editor has two pages:

Parallel Job Developers Guide

55-1

Example

Write Range Map Stage

Stage Page. This is always present and is used to specify general information about the stage. Inputs Page. This is present when you are writing a range map. This is where you specify details about the file being written to.

Example
In this example, we sample the data in a flat file then pass it to the Write Range Map stage. The stage sorts the data itself before constructing a range map and writing it to a file. Here is the example job:

The stage sorts the data on the same key that it uses to create the range map. The following shows how the on-stage sort is configured,

55-2

Parallel Job Developers Guide

Write Range Map Stage

Must Dos

and the properties that determine how the stage will produce the range map:

Must Dos
DataStage has many defaults which means that it can be very easy to include Write Range Map stages in a job. This section specifies the minimum steps to take to get a Write Range Map stage functioning. DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product.

Parallel Job Developers Guide

55-3

Stage Page

Write Range Map Stage

To use a Write Range Map stage: In the Input Link Properties Tab:

Specify the key column(s) for the range map you are creating. Specify the name of the range map you are creating. Specify whether it is OK to overwrite an existing range map of that name (be default an error occurs if a range map with that name already exists).

Ensure that column definitions have been specified for the range map (this can be done in an earlier stage).

Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than the project default to determine collating rules.

Advanced Tab
This tab allows you to specify the following: Execution Mode. The stage always executes in sequential mode. Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. Preserve partitioning. This is Set by default. The Partition type is range and cannot be overridden. Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

55-4

Parallel Job Developers Guide

Write Range Map Stage

Inputs Page

NLS Locale Tab


This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Write Range Map stage uses this when it is determining the sort order for key columns. Select a locale from the list, or click the arrow button next to the list to use a job parameter or browse for a collate file.

Inputs Page
The Inputs page allows you to specify details about how the Write Range Map stage writes the range map to a file. The Write Range Map stage can have only one input link. The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to view collecting details. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the input link. Details about Write Range Map stage properties and collecting are given in the following sections. See Chapter 3, "Stage Editors," for a general description of the other tabs.

Parallel Job Developers Guide

55-5

Inputs Page

Write Range Map Stage

Input Link Properties Tab


The Properties tab allows you to specify properties for the input link. These dictate how incoming data is written to the range map file. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Category/ Property
Options/File Update Mode Options/Key Options/Range Map File

Values
Create/ Overwrite input column pathname

Default Mandatory? Repeats? Dependent of


Create Y N N/A

N/A N/A

Y Y

Y N

N/A N/A

Options Category
File Update Mode This is set to Create by default. If the file you specify already exists this will cause an error. Choose Overwrite to overwrite existing files. Key This allows you to specify the key for the range map. Choose an input column from the drop-down list. You can specify a composite key by specifying multiple key properties. You can use the Column Selection dialog box to select several keys at once if required (see page 3-10). Range Map File Specify the file that is to hold the range map. You can browse for a file or specify a job parameter.

Partitioning Tab
The Partitioning tab normally allows you to specify details about how the incoming data is partitioned or collected before it is written to the file or files. In the case of the Write Range Map stage execution is

55-6

Parallel Job Developers Guide

Write Range Map Stage

Inputs Page

always sequential, so there is never a need to set a partitioning method. You can set a collection method if collection is required. The following Collection methods are available: (Auto). This is the default collection method for the Column Generator stage. Normally, when you are using Auto mode, DataStage will eagerly read any row from any input partition as it becomes available. Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operation starts over. Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the write range map operation is performed. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the collecting method chosen (it is not available for the default auto methods). Select the check boxes as follows: Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. Stable. Select this if you want to preserve previously sorted data sets. This is the default. Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu. Because the partition mode is set and cannot be overridden, you cannot use the stage sort facilities, so these are disabled.

Parallel Job Developers Guide

55-7

Inputs Page

Write Range Map Stage

55-8

Parallel Job Developers Guide

56
Parallel Jobs on USS
This chapter explains how parallel jobs can be deployed and run on mainframe systems running z/OS UNIX System Services (popularly known as USS). For information on installing the parallel engine on the USS machine, and setting up remote access to it, see "Installing DataStage Components on a USS System" in DataStage Install and Upgrade Guide. You specify that you want to run jobs on USS systems in the DataStage Administrator client. This is done on a per-project basis. Once you have elected to deploy jobs in a project to USS, you can no longer run parallel jobs from that project on your DataStage server unless you opt to switch back. See "Remote Page" in DataStage Administrator Guide for details on how to set up a project for deployment to USS.
Note You cannot include server shared containers, BASIC Transformer stages, or plugin stages in a job intended for deployment on a USS system. The DB2 stage is the only database stage currently supported.

Set Up
To set up the deployment and running of parallel jobs on a USS system, you need to take the following steps:

Parallel Job Developers Guide

56-1

Deployment Options

Parallel Jobs on USS

Use the DataStage Administrator to specify a project that will be used for parallel jobs intended for USS deployment (see "Remote Page" in DataStage Administrator Guide). Install the parallel engine on the USS machine and set up access to it as described in "Installing DataStage Components on a USS System" in DataStage Install and Upgrade Guide. On the server machine, set the environment variable APT_ORCHHOME to identify the parallel engines top-level directory on the USS system. On the DataStage server machine, construct a suitable configuration file, and set the APT_CONFIG_FILE environment variable to point to it.

Deployment Options
There are two options for deploying on USS: Under control of DataStage. Jobs run under the control of the DataStage Director client. This method suits the scenario where the job developer has direct access to the USS machine. Deploy standalone. Parallel jobs scripts are transferred to the USS machine and run there totally independently of DataStage. This method suites the scenario where jobs are run by operators or external schedulers, maybe overnight. You can have both of these options selected at once, if required, so you do not have to decide how to run a job until you come to run it.

Deploy Under Control of DataStage


With this option selected, you design a job as normal using the DataStage Designer. When you compile the job, DataStage automatically sends it to the machine and the location specified in the DataStage Administrator. When you are ready to run the job, you start the DataStage Director client and select the job and run it as you would any other job. DataStage sends two more files to the USS machine, specifying environment variables and job parameters for the job run. It then uses a remote shell to execute the job on the USS machine. You can specify the remote shell commands and options in the DataStage Administrator. As the job runs, logging information is captured from the remotely executing job and placed in the DataStage log in real time. The log messages indicate that they originate from a remote machine. You

56-2

Parallel Job Developers Guide

Parallel Jobs on USS

Deployment Options

can monitor the job from the Director, and collect process meta data for MetaStage.
Note Only size-based monitoring is available when jobs run on the USS system: i.e., you cannot set APT_MONITOR_TIME, only APT_MONITOR_SIZE.

You can run a job on a USS system using the command line or job control interfaces on your DataStage Server as described in the Parallel Job Advanced Developers Guide. You can also include jobs in a job sequence. There are certain restrictions on the use of the built-in DataStage macros when running jobs on USS:
Macro
DSHostName DSProjectName DSJobController DSJobName DSJobStartTimeStamp DSJobStartDate DSJobStartTime DSJobWaveNo DSJobInvocationId DSProjectMapName

Restrictions
Name of DataStage server, not of USS machine Supported Supported Supported Supported, but gives server date and time Supported, but gives server date Supported, but gives server time Supported Supported Supported (internal value)

When you deploy under the control of DataStage, certain other functions besides running jobs are available: View Data. Data set management tool. Configuration file editing and validation. Deployment of build stages on USS. Importing Orchestrate schemas. Special considerations about these features are described in the following sections.

Using View Data


The View Data button is available on some stage editors, and allows you to view the actual data on a source stage. This facility is available
Parallel Job Developers Guide 56-3

Deployment Options

Parallel Jobs on USS

in USS projects if FTP and remote shell options are enabled. DataStage FTPs and remotely executes a script on the USS machine which accesses the data and returns it to the DataStage server.

Using the Data Set Management Tool


The Data Set Management tool is available from the DataStage Designer, Director, and Manager clients. This allows you to view source or, provided the job has already been run, target data sets (see Chapter 57, "Managing Data Sets.") The tool is available from USS projects if FTP and remote shell options are enabled. The header bar of the data set Browse File dialog box indicates that you are browsing data sets on a remote machine.

Editing and Validating Configuration Files


In order to run parallel jobs on a USS machine, it must have a configuration file which describes its parallel capabilities. The default configuration file supplied with the project is NOT suitable for USS deployment (it is designed to run jobs on the DataStage server).The DataStage Manager has a tool which allows you to create, edit and validate configuration files (see Chapter 58, "The Parallel Engine Configuration File.") When you use this tool from within a USS project that is deployed under the control of DataStage, the configuration file is mirrored on the USS machine. It is updated whenever the file on the server is saved. When you use the Check feature, the file is validated against the USS syatem configuration. The title bar of the Configuration file tool indicates that the file is on the USS machine. We recommend that you change the APT_CONFIG_FILE environment variable in your project to point to the location of your USS configuration file on the server machine (DataStage knows the location on the USS machine and translates as appropriate). Although you can set it to point directly to the file on the USS machine itself.

Deploying Build Stages


DataStage allows you to develop your own stages for parallel jobs as described in "Specifying Your Own Parallel Stages" in the Parallel Job Advanced Developers Guide. You can deploy such stages to USS systems for inclusion in parallel jobs. When you generate a Build stage in a USS project, it is automatically sent to the USS machine and built there so that any jobs developed that use the stage will successfully run under USS.

56-4

Parallel Job Developers Guide

Parallel Jobs on USS

Deployment Options

Importing Orchestrate Schemas


DataStage allows you to import table definitions from Orchestrate schemas. This uses the Import Orchestrate Schema wizard, available from the Manager and the Designer (see "Importing a Table Definition" in DataStage Manager Guide). When you are importing definitions into a USS project, the wizard allows you to import from text files, data sets, or file sets on the USS machine.

Deploy Standalone
With this option selected, you design a job as normal using the DataStage Designer. When you compile the job, DataStage produces files which can be transferred to the USS machine using your preferred method. You can then set the correct execute permissions for the files and run the job on the USS machine by executing scripts. If you have specified a remote machine name in the DataStage Administrator Project Properties Remote tab (see "Remote Page" in DataStage Administrator Guide), files will automatically be sent to the USS machine. The job can then be run by executing the scripts on the machine to compile any transformers the job contains and then run the job. You can also enable the send and/or remote shell capabilities in isolation by supplying the required details to the Remote page in the project properties in the DataStage Administrator. Different restrictions reply to the DataStage built-in macros when you run a job using the deploy standalone method:
Macro
DSHostName DSProjectName DSJobController DSJobName DSJobStartTimeStamp DSJobStartDate DSJobStartTime DSJobWaveNo DSJobInvocationId DSProjectMapName

Restrictions
Name of DataStage server, not of USS machine Supported Not supported Supported Not supported Not supported Not supported Not supported Not supported Supported (internal value)

Parallel Job Developers Guide

56-5

Implementation Details

Parallel Jobs on USS

Details of the files that are created and where they should be transferred to on the USS machine are given in the following section, "Implementation Details".

Implementation Details
This section describes the directory structure required for job deployment on USS machines. It also describes files that are generated by DataStage when you compile a job in a USS project.

Directory Structure
Each job deployed to the USS machine must have a dedicated directory. If you are allowing DataStage to automatically send files to the USS machine for you at compile time, the files are, by default, copied to the following directory:
/Base_directory/project_name/RT_SCjobnumber

Base_directory. You must specify a specific base directory in the DataStage Administrator (see "Remote Page" in DataStage Administrator Guide). project_name. This is a directory named after the USS project. RT_SCjobnum. This is the directory that holds all the deployment files for a particular job. By default the job directory is RT_SCjobnum where jobnum is the internal jobnumber allocated by DataStage, but you can change the form of this name in the DataStage Administrator (see see "Remote Page" in DataStage Administrator Guide). If you are deploying standalone, and are not automatically sending files, you can specify your own directory structure, but we recommend that you follow the /base_directory/project_name/ job_identifier model. On the DataStage server the files are copied to the directory:
$DSHOME/../Projects/project_name/RT_SCjobnumber

56-6

Parallel Job Developers Guide

Parallel Jobs on USS

Implementation Details

Generated Files
When you compile a parallel job intended for deployment on a USS machine, it produces various files which are copied into the job directory on the USS machine. The files are as follows:
File
OshScript.osh

Purpose
The main parallel job script . This script is run automatically via a remote shell when jobs are run under the control of DataStage. The script needs to be run manually using the pxrun.sh script when jobs are deployed standalone. This script is run in order to run OshScript.osh when jobs are deployed standalone. This is used by pxrun.sh. It contains the job parameters for a job deployed standalone when it is run. It is based on the default job parameters when the job was compiled. This is sourced by pxrun.sh. It contains the environment variables for a job deployed standalone when it is run. It is based on the environment variables set when the job was compiled. This file is generated if the job contains one or more Transformer stages and the Deploy Standalone option is selected. It is used to control the compilation of the transformers on the USS machine. There is a file for each Transformer stage in the job; it contains the source code for each stage. This is a script for compiling Transformer stages. There is one for each transformer stage. It is called by pxcompile.sh; it can be called individually if required. Parallel job script to compile the corresponding Transformer stage. Called from corresponding .sh file.

pxrun.sh

jpdepfile

evdepfile

pxcompile.sh

internalidentifier _jobname_stagename.trx internalidentifier _jobname_stagename.trx.sh

internalidentifier _jobname_stagename.trx.osh

Where you are deploying jobs under the control of DataStage, you will also see the following files in the job directory on the USS machine: OshExecute.sh. This executes the job script, OshScript.osh, under the control of DataStage. You should NOT attempt to run this file manually.
Parallel Job Developers Guide 56-7

Running Jobs on the USS Machine

Parallel Jobs on USS

jpfile and evfile. These are visible while the job is actually running and contain the job parameters and environment variables used for the job run. If your job contains one or more Transformer stages, you will also see the following files in your job directory: jobnamestagename.trx.so. Object file, one for each Transformer stage. jobnamestagename.trx.C. If the compilation of the corresponding Transformer stage fails for some reason, this file is left behind.

Configuration Files
In order to run parallel jobs on the USS machine, there must be a configuration file describing the parallel capabilities of that machine (see Chapter 58, "The Parallel Engine Configuration File.") If you deploy jobs under the control of DataStage the configuration maintained on the server will be automatically mirrored on the USS machine when you edit it. If you deploy jobs standalone, you must ensure that the USS system has a valid configuration file identified by the environment variable APT_CONFIG_FILE. For more information on configuration files and USS systems, see "Installing DataStage Components on a USS System" in DataStage Install and Upgrade Guide.

Running Jobs on the USS Machine


This section describes three basic scenarios for running DataStage parallel jobs on a USS machine: Deploying under the control of DataStage and running from the DataStage Director. Deploying under the control of DataStage but running manually. Deploying and running manually.

Deploying and Running from DataStage


In order to deploy the job from DataStage and run it from the DataStage Director, proceed as follows:
1

In the DataStage Administrator, in the Project Properties dialog box, set up the project on the Remote page as follows:

Select the Jobs run under control of DataStage option.

56-8

Parallel Job Developers Guide

Parallel Jobs on USS

Running Jobs on the USS Machine

Specify the name of the target machine and the username and password used to connect to it. This is used to send the job deployment files to the USS machine. Specify a template for the remote shell used to run the jobs (in most cases you can use the default, so need take no action here). Specify a base directory on the USS machine to hold the project directory and all the individual job directories. Optionally specify a template for naming the job directories on the USS machine. Optionally specify commands to be executed on the DataStage server after the job files have been deployed.

For more details about making these settings in the DataStage Administrator, see see "Remote Page" in DataStage Administrator Guide.
2

In the DataStage Designer, design your parallel job as normal (but remember that you cannot use BASIC Transformer stages, shared containers, or plugin stages in jobs to run under USS). When you are happy with your job design, compile it. As part of this process, the necessary files will be sent to the specified location on the USS machine, and the remote shell invoked to set permissions and perform other housekeeping tasks. The environment variables as set at job compile time, and any default settings for job parameters are transferred as part of this process. In the DataStage Director, select the job and run it. Set the required parameters and set any environment variables required for this run in the Job Run Options dialog box. DataStage will use the remote shell to run the job on the USS machine (if required you could alternatively run the job from the command line of the server machine, or using the job control facilities described in"DataStage Development Kit (Job Control Interfaces)" in the Parallel Job Advanced Developers Guide).

Deploying from DataStage, Running Manually


This section described a halfway-house solution, whereby you can use DataStage to automatically copy the required files to the USS machine, and set the correct permissions, but run the jobs manually directly from the USS machine.

Parallel Job Developers Guide

56-9

Running Jobs on the USS Machine

Parallel Jobs on USS

In the DataStage Administrator, in the Project Properties dialog box Parallel page, set up the project as follows:
a b

Select the Jobs run under control of DataStage option and the Deploy Standalone parallel job scripts option. Specify the name of the target machine and the username and password used to connect to it. This is used to FTP the job deployment files to the USS machine. Specify a template for the remote shell used to run the jobs. Optionally specify a base directory on the USS machine to hold the project directory and all the individual job directories. Optionally specify a template for naming the job directories on the USS machine. Optionally specify commands to be executed on the DataStage server after the job files have been deployed.

c d e f

For more details about making these settings in the DataStage Administrator, see see "Remote Page" in DataStage Administrator Guide.
2

In the DataStage Administrator, set the environment variable APT_CONFIG_FILE to identify the configuration file used to run jobs on the USS system. In the DataStage Designer, design your parallel job as normal (but remember that you cannot use BASIC Transformer stages, server shared containers, or plugin stages in jobs to run under USS). When you are happy with your job design, compile it. As part of this process, the necessary files will be FTPed to the specified location on the USS machine, and the remote shell invoked to set permissions and perform other housekeeping tasks. The environment variables as set at job compile time, and any default settings for job parameters are transferred as part of this process. When you are ready to run your job, on the USS machine, go the job directory for the required job. If your job contains Transformer stages, execute the following file: pxcompile.sh When your Transformer stages have successfully compiled, run the job by executing the following file: pxrun.sh

5 6

56-10

Parallel Job Developers Guide

Parallel Jobs on USS

Running Jobs on the USS Machine

Deploying and Running Manually


This describes how to set DataStage up so that you both transfer jobs to the USS machine and run them manually, without further intervention by DataStage.
1

In the DataStage Administrator, in the Project Properties dialog box Remote page, set up the project as follows:

Select the Deploy Standalone parallel job scripts option.

In the DataStage Administrator, set the environment variable APT_CONFIG_FILE to identify the configuration file used to run jobs on the USS system. In the DataStage Designer, design your parallel job as normal (but remember that you cannot use BASIC Transformer stages, shared containers, or plugin stages in jobs to run under USS). When you are happy with your job design, compile it. On your DataStage server, go to the directory:
$DSHOME/../Projects/project_name/RT_SCjobnumber

4 5

Copy the following files to a directory on your USS machine (each job must be in a separate directory):

OshScript.osh pxrun.sh jpdepfile evdepfile

If your job contains Transformer stages, you will also need to copy the following files:

pxcompile.sh jobnamestagename.trx jobnamestagename.trx.sh jobnamestagename.trx.osh

Note You can enter commands in the Custom deployment commands field in the DataStage Administrator Project Properties dialog box Remote page to further automate the process of deployment, for example:
tar -cvf * %j.tar cp %j.tar /home/mydeploy

When you are ready to run your job, on the USS machine, go the job directory for the required job.

Parallel Job Developers Guide

56-11

Running Jobs on the USS Machine

Parallel Jobs on USS

If your job contains Transformer stages, execute the following file: pxcompile.sh When your Transformer stages have successfully compiled, run the job by executing the following file: pxrun.sh

56-12

Parallel Job Developers Guide

57
Managing Data Sets
DataStage parallel extender jobs use data sets to store data being operated on in a persistent form. Data sets are operating system files, each referred to by a descriptor file, usually with the suffix .ds. You can create and read data sets using the Data Set stage, which is described in Chapter 6. DataStage also provides a utility for managing data sets from outside a job. This utility is available from the DataStage Designer, Manager, and Director clients.

Structure of Data Sets


A data set comprises a descriptor file and a number of other files that are added as the data set grows. These files are stored on multiple disks in your system. A data set is organized in terms of partitions and segments. Each partition of a data set is stored on a single processing node. Each data segment contains all the records written by a single DataStage job. So a segment can contain files from many partitions, and a partition has files from many segments.

Parallel Job Developers Guide

57-1

Starting the Data Set Manager

Managing Data Sets

Partition 1

Partition 2

Partition 3

Partition 4

Segment 1

Segment 2

Segment 3

One or more data files

The descriptor file for a data set contains the following information: Data set header information. Creation time and data of the data set. The schema of the data set. A copy of the configuration file use when the data set was created. For each segment, the descriptor file contains: The time and data the segment was added to the data set. A flag marking the segment as valid or invalid. Statistical information such as number of records in the segment and number of bytes. Path names of all data files, on all processing nodes. This information can be accessed through the Data Set Manager.

Starting the Data Set Manager


To start the Data Set Manager from the DataStage Designer, Manager, or Director:

57-2

Parallel Job Developers Guide

Managing Data Sets

Starting the Data Set Manager

Choose Tools Data Set Management, a Browse Files dialog box appears:

2 3

Navigate to the directory containing the data set you want to manage. By convention, data set files have the suffix .ds. Select the data set you want to manage and click OK. The Data Set Viewer appears. From here you can copy or delete the chosen data set. You can also view its schema (column definitions) or the data it contains.

Parallel Job Developers Guide

57-3

Data Set Viewer

Managing Data Sets

Data Set Viewer


The Data Set viewer displays information about the data set you are viewing: Partitions The partition grid shows the partitions the data set contains and describes their properties: #. The partition number. Node. The processing node that the partition is currently assigned to. Records. The number of records the partition contains. Blocks. The number of blocks the partition contains. Bytes. The number of bytes the partition contains. Segments Click on an individual partition to display the associated segment details. This contains the following information: #. The segment number. Created. Date and time of creation. Bytes. The number of bytes in the segment. Pathname. The name and path of the file containing the segment in the selected partition. Click the Refresh button to reread and refresh all the displayed information. Click the Output button to view a text version of the information displayed in the Data Set Viewer. You can open a different data set from the viewer by clicking the Open icon on the tool bar. The browse dialog open box opens again and lets you browse for a data set.

57-4

Parallel Job Developers Guide

Managing Data Sets

Data Set Viewer

Viewing the Schema


Click the Schema icon from the tool bar to view the record schema of the current data set. This is presented in text form in the Record Schema window:

Viewing the Data


Click the Data icon from the tool bar to view the data held by the current data set. This options the Data Viewer Options dialog box, which allows you to select a subset of the data to view.

Rows to display. Specify the number of rows of data you want the data browser to display. Skip count. Skip the specified number of rows before viewing data. Period. Display every Pth record where P is the period. You can start after records have been skipped by using the Skip property. P must equal or be greater than 1. Partitions. Choose between viewing the data in All partitions or the data in the partition selected from the drop-down list.

Parallel Job Developers Guide

57-5

Data Set Viewer

Managing Data Sets

Click OK to view the selected data, the Data Viewer window appears:

Copying Data Sets


Click the Copy icon on the tool bar to copy the selected data set. The Copy data set dialog box appears, allowing you to specify a path where the new data set will be stored:

The new data set will have the same record schema, number of partitions and contents as the original data set.
Note You cannot use the UNIX cp command to copy a data set because DataStage represents a single data set with multiple files.

Deleting Data Sets


Click the Delete icon on the tool bar to delete the current data set data set. You will be asked to confirm the deletion.

57-6

Parallel Job Developers Guide

Managing Data Sets

Data Set Viewer

Note You cannot use the UNIX rm command to copy a data set because DataStage represents a single data set with multiple files. Using rm simply removes the descriptor file, leaving the much larger data files behind.

Parallel Job Developers Guide

57-7

Data Set Viewer

Managing Data Sets

57-8

Parallel Job Developers Guide

58
The Parallel Engine Configuration File
One of the great strengths of the DataStage Enterprise Edition is that, when designing parallel jobs, you dont have to worry too much about the underlying structure of your system, beyond appreciating its parallel processing capabilities. If your system changes, is upgraded or improved, or if you develop a job on one platform and implement it on another, you dont necessarily have to change your job design. DataStage learns about the shape and size of the system from the configuration file. It organizes the resources needed for a job according to what is defined in the configuration file. When your system changes, you change the file not the jobs. This chapter describes how to define configuration files that specify what processing, storage, and sorting facilities on your system should be used to run a parallel job. You can maintain multiple configuration files and read them into the system according to your varying processing needs. When you install DataStage Enterprise Edition the system is automatically configured to use the supplied default configuration file. This allows you to run parallel jobs right away, but is not optimized for your system. Follow the instructions in this chapter to produce configuration file specifically for your system.

Configurations Editor
The DataStage Manager provides a configuration file editor to help you define configuration files for the parallel engine. To use the editor,

Parallel Job Developers Guide

58-1

Configurations Editor

The Parallel Engine Configuration File

choose Tools Configurations, the Configurations dialog box appears:

To define a new file, choose (New) from the Configurations dropdown list and type into the upper text box. Guidance on the operation and format of a configuration file is given in the following sections. Click Save to save the file at any point. You are asked to specify a configuration name, the config file is then saved under that name with an .apt extension.

58-2

Parallel Job Developers Guide

The Parallel Engine Configuration File

Configuration Considerations

You can verify your file at any time by clicking Check. Verification information is output in the Check Configuration Output pane at the bottom of the dialog box.

To edit an existing configuration file, choose it from the Configurations drop-down list. You can delete an existing configuration by selecting it and clicking Delete. You are warned if you are attempting to delete the last remaining configuration file. You specify which configuration will be used by setting the APT_CONFIG_FILE environment variable. This is set on installation to point to the default configuration file, but you can set it on a project wide level from the DataStage Administrator (see Setting Environment Variables in DataStage Administrator Guide) or for individual jobs from the Job Properties dialog (see Environment Variables on page -10).

Configuration Considerations
The parallel engines view of your system is determined by the contents of your current configuration file. Your file defines the processing nodes and disk space connected to each node that you allocate for use by parallel jobs. When invoking a parallel job, the parallel engine first reads your configuration file to determine what system resources are allocated to it and then distributes the job to those resources.

Parallel Job Developers Guide

58-3

Configuration Considerations

The Parallel Engine Configuration File

When you modify the system by adding or removing nodes or disks, you must modify your configuration file correspondingly. Since the parallel engine reads the configuration file every time it runs a parallel job, it automatically scales the application to fit the system without your having to alter the job code. Your ability to modify the parallel engine configuration means that you can control the parallelization of a parallel job during its development cycle. For example, you can first run the job on one node, then on two, then on four, and so on. You can measure system performance and scalability without altering application code.

Logical Processing Nodes


A parallel engine configuration file defines one or more processing nodes on which your parallel job will run. The processing nodes are logical rather than physical. The number of processing nodes does not necessarily correspond to the number of CPUs in your system. Your configuration file can define one processing node for each physical node in your system, or multiple processing nodes for each physical node.

Optimizing Parallelism
The degree of parallelism of a parallel job is determined by the number of nodes you define when you configure the parallel engine. Parallelism should be optimized for your hardware rather than simply maximized. Increasing parallelism distributes your work load but it also adds to your overhead because the number of processes increases. Increased parallelism can actually hurt performance once you exceed the capacity of your hardware. Therefore you must weigh the gains of added parallelism against the potential losses in processing efficiency. Obviously, the hardware that makes up your system influences the degree of parallelism you can establish. SMP systems allow you to scale up the number of CPUs and to run your parallel application against more memory. In general, an SMP system can support multiple logical nodes. Some SMP systems allow scalability of disk I/O. Configuration Options for an SMP on page -6 discusses these considerations. In a cluster or MPP environment, you can use the multiple CPUs and their associated memory and disk resources in concert to tackle a single computing problem. In general, you have one logical node per CPU on an MPP system. Configuration Options for an MPP System on page -9 describes these issues.

58-4

Parallel Job Developers Guide

The Parallel Engine Configuration File

Configuration Considerations

The properties of your systems hardware also determines configuration. For example, applications with large memory requirements, such as sort operations, are best assigned to machines with a lot of memory. Applications that will access an RDBMS must run on its server nodes; and stages using other proprietary software, such as SAS, must run on nodes with licenses for that software. Here are some additional factors that affect the optimal degree of parallelism: CPU-intensive applications, which typically perform multiple CPUdemanding operations on each record, benefit from the greatest possible parallelism, up to the capacity supported by your system. Parallel jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions. Applications that are disk- or I/O-intensive, such as those that extract data from and load data into RDBMSs, benefit from configurations in which the number of logical nodes equals the number of disk spindles being accessed. For example, if a table is fragmented 16 ways inside a database or if a data set is spread across 16 disk drives, set up a node pool consisting of 16 processing nodes. For some jobs, especially those that are disk-intensive, you must sometimes configure your system to prevent the RDBMS from having either to redistribute load data or to re-partition the data from an extract operation. The speed of communication among stages should be optimized by your configuration. For example, jobs whose stages exchange large amounts of data should be assigned to nodes where stages communicate by either shared memory (in an SMP environment) or a high-speed link (in an MPP environment). The relative placement of jobs whose stages share small amounts of data is less important. For SMPs, you may want to leave some processors for the operating system, especially if your application has many stages in a job. See Configuration Options for an SMP on page -6. In an MPP environment, parallelization can be expected to improve the performance of CPU-limited, memory-limited, or disk I/O-limited applications. See Configuration Options for an MPP System on page -9. The most nearly-equal partitioning of data contributes to the best overall performance of a job run in parallel. For example, when hash partitioning, try to ensure that the resulting partitions are evenly populated.This is referred to as minimizing skew. Experience is the best teacher. Start with smaller data sets and try different

Parallel Job Developers Guide

58-5

Configuration Considerations

The Parallel Engine Configuration File

parallelizations while scaling up the data set sizes to collect performance statistics.

Configuration Options for an SMP


An SMP contains multiple CPUs which share operating system, disk, and I/O resources. Data is transported by means of shared memory. A number of factors contribute to the I/O scalability of your SMP These . include the number of disk spindles, the presence or absence of RAID, the number of I/O controllers, and the speed of the bus connecting the I/O system to memory. SMP systems allow you to scale up the number of CPUs. Increasing the number of processors you use may or may not improve job performance, however, depending on whether your application is CPU-, memory-, or I/O-limited. If, for example, a job is CPU-limited, that is, the memory, memory bus, and disk I/O of your hardware spend a disproportionate amount of time waiting for the CPU to finish its work, it will benefit from being executed in parallel. Running your job on more processing units will shorten the waiting time of other resources and thereby speed up the overall application. All SMP systems allow you to increase your parallel jobs memory access bandwidth. However, none allow you to increase the memory bus capacity beyond that of the hardware configuration. Therefore, memory-intensive jobs will also benefit from increased parallelism, provided they do not saturate the memory bus capacity of your system. If your application is already approaching, or at the memory bus limit, increased parallelism will not provide performance improvement. Some SMP systems allow scalability of disk I/O. In those systems, increasing parallelism can increase the overall throughput rate of jobs that are disk I/O-limited.

58-6

Parallel Job Developers Guide

The Parallel Engine Configuration File

Configuration Considerations

For example, the following figure shows a data flow containing three parallel stages:
data flow

stage 1

stage 2

stage 3

For each stage in this data flow, the parallel engine creates a single UNIX process on each logical processing node (provided that stage combining is not in effect). On an SMP defined as a single logical node, each stage runs sequentially as a single process, and the parallel engine executes three processes in total for this job. If the SMP has three or more CPUs, the three processes in the job can be executed simultaneously by different CPUs. If the SMP has fewer than three CPUs, the processes must be scheduled by the operating system for execution, and some or all of the processors must execute multiple processes, preventing true simultaneous execution. In order for an SMP to run parallel jobs, you configure the parallel engine to recognize the SMP as a single or as multiple logical processing node(s), that is:
1 <= M <= N logical processing nodes, where N is the number of CPUs on the SMP and M is the number of processing nodes on the configuration. (Although M can be greater than N when there are more disk spindles than there are CPUs.)

As a rule of thumb, it is recommended that you create one processing node for every two CPUs in an SMP You can modify this configuration . to determine the optimal configuration for your system and application during application testing and evaluation. In fact, in most cases the scheduling performed by the operating system allows for significantly more than one process per processor to be managed before performance degradation is seen. The exact number depends on the nature of the processes, bus bandwidth, caching effects, and other factors.

Parallel Job Developers Guide

58-7

Configuration Considerations

The Parallel Engine Configuration File

Depending on the type of processing performed in your jobs (sorting, statistical analysis, database I/O), different configurations may be preferable. For example, on an SMP viewed as a single logical processing node, the parallel engine creates a single UNIX process on the processing node for each stage in a data flow. The operating system on the SMP schedules the processes to assign each process to an individual CPU. If the number of processes is less than the number of CPUs, some CPUs may be idle. For jobs containing many stages, the number of processes may exceed the number of CPUs. If so, the processes will be scheduled by the operating system. Suppose you want to configure the parallel engine to recognize an eight-CPU SMP for example, as two or more processing nodes. When , you configure the SMP as two separate processing nodes, the parallel engine creates two processes per stage on the SMP For the three. stage job shown above, configuring an SMP as more than two parallel engine processing nodes creates at least nine UNIX processes, although only eight CPUs are available. Process execution must be scheduled by the operating system. For that reason, configuring the SMP as three or more parallel engine processing nodes can conceivably degrade performance as compared with that of a one- or two-processing node configuration. This is so because each CPU in the SMP shares memory, I/O, and network resources with the others. However, this is not necessarily true if some stages read from and write to disk or the network; in that case, other processes can use the CPU while the I/O-bound processes are blocked waiting for operations to finish.

Example Configuration File for an SMP


This section contains a sample configuration file for the four-CPU SMP shown below:

CPU CPU

CPU CPU

SMP

58-8

Parallel Job Developers Guide

The Parallel Engine Configuration File

Configuration Considerations

The table below lists the processing node names and the file systems used by each processing node for both permanent and temporary storage in this example system:
Node name
node0 node1

Node name on fast network


node0_byn node0_byn

Node pools
"", node0, node0_fddi "", node1, node1_fddi

Directory for permanent storage


/orch/s0 /orch/s1 /orch/s0 /orch/s1

Directory for temp storage


/scratch /scratch

The table above also contains a column for node pool definitions. Node pools allow you to execute a parallel job or selected stages on only the nodes in the pool. See Node Pools and the Default Node Pool on page -22 for more details. In this example, the parallel engine processing nodes share two file systems for permanent storage. The nodes also share a local file system (/scratch) for temporary storage. Here is the configuration file corresponding to this system. Configuration Files on page -15 discusses the keywords and syntax of configuration files.
{ node "node0" { fastname "node0_byn" /* node name on a fast network */ pools "" "node0" "node0_fddi" /* node pools */ resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } node "node1" { fastname "node0_byn" pools "" "node1" "node1_fddi" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } }

Configuration Options for an MPP System


An MPP consists of multiple hosts, where each host runs its own image of the operating system and contains its own processors, disk, I/O resources, and memory. This is also called a shared-nothing environment. Each host in the system is connected to all others by a high-speed network. A host is also referred to as a physical node. In an MPP environment, you can use the multiple CPUs and their associated memory and disk resources in concert. In this
Parallel Job Developers Guide 58-9

Configuration Considerations

The Parallel Engine Configuration File

environment, each CPU has its own dedicated memory, memory bus, disk, and disk access. When configuring an MPP you specify the physical nodes in your , system on which the parallel engine will run your parallel jobs. You do not have to specify all nodes.

An Example of a Four-Node MPP System Configuration


The following figure shows a sample MPP system containing four physical nodes: high-speed network (switch)

node0_css
CPU

node1_css
CPU

node2_css
CPU

node3_css
CPU

node0

node1

node2

node3

Ethernet This figure shows a disk-everywhere configuration. Each node is connected to both a high-speed switch and an Ethernet. Note that the configuration information below for this MPP would be similar for a cluster of four SMPs connected by a network. The following table shows the storage local to each node:

Node name

Node name on fast network

Node pools

Directory for permanent storage

Directory for temp storage

node0

node0_css

"", node0, node0_cs s


"", node1, node1_css "", node2, node2_css "", node3, node3_css

/orch/s0 /orch/s1

/scratch

node1 node2 node3

node1_css node2_css node3_css

/orch/s0 /orch/s1 /orch/s0 /orch/s1 /orch/s0 /orch/s1

/scratch /scratch /scratch

58-10

Parallel Job Developers Guide

The Parallel Engine Configuration File

Configuration Considerations

Note that because this is an MPP system, each node in this configuration has its own disks and hence its own /orch/s0, /orch/s1, and /scratch. If this were an SMP the logical nodes would be sharing , the same directories. Here is the configuration file for this sample system. Configuration Files on page -15 discusses the keywords and syntax of configuration files.
{ node "node0" { fastname "node0_css" /* node name on a fast network*/ pools "" "node0" "node0_css" /* node pools */ resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } node "node1" { fastname "node1_css" pools "" "node1" "node1_css" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } node "node2" { fastname "node2_css" pools "" "node2" "node2_css" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } node "node3" { fastname "node3_css" pools "" "node3" "node3_css" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } }

Configuration Options for an SMP Cluster


An SMP cluster consists of one or more SMPs and, optionally, singleCPU nodes connected by a high-speed network. In this case, each SMP in the cluster is referred to as a physical node. When you configure your system, you can divide a physical node into logical nodes. The following figure shows a cluster containing four physical nodes, one of which (node1) is an SMP containing two CPUs.

Parallel Job Developers Guide

58-11

Configuration Considerations

The Parallel Engine Configuration File

high-speed network (switch)

node0_css
CPU

node1_css
CPU CPU

node2_css
CPU

node3_css
CPU

node0

node1

node2

node3

Ethernet

An Example of an SMP Cluster Configuration


The following configuration file divides physical node1 into logical nodes node1 and node1a. Both are connected to the high-speed switch by the same fastname; in the configuration file, the same fastname is specified for both nodes. Configuration Files on page -15 discusses the keywords and syntax of Orchestrate configuration files.
{ node "node0" { fastname "node0_css"/* node name on a fast network */ pools "" "node0" "node0_css" /* node pools */ resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } node "node1" { fastname "node1_css" pools "" "node1" "node1_css" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } node "node1a"{ fastname "node1_css" pools "" "node1" "node1_css" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } node "node2" { fastname "node2_css" pools "" "node2" "node2_css" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} }

58-12

Parallel Job Developers Guide

The Parallel Engine Configuration File

Configuration Considerations

node "node3" { fastname "node3_css" pools "" "node3" "node3_css" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } }

In this example, consider the disk definitions for /orch/s0. Since node1 and node1a are logical nodes of the same physical node, they share access to that disk. Each of the remaining nodes, node0, node2, and node3, has its own /orch/s0 that is not shared. That is, there are four distinct disks called /orch/s0. Similarly, /orch/s1 and /scratch are shared between node1 and node1a but not the others.

Options for a Cluster with the Conductor Unconnected to the High-Speed Switch
The parallel engine starts your parallel job from the Conductor node. A cluster may have a node that is not connected to the others by a high-speed switch, as in the following figure
:

high-speed network (switch)

node0_css
CPU

node1_css
CPU CPU

node2_css
CPU

node3_css
CPU

node0

node1

node2

node3

ethernet
node4
CPU

In this example, node4 is the Conductor, which is the node from which you need to start your application. By default, the parallel engine communicates between nodes using the fastname, which in this example refers to the high-speed switch. But because the Conductor is not on that switch, it cannot use the fastname to reach the other nodes.

Parallel Job Developers Guide

58-13

Configuration Considerations

The Parallel Engine Configuration File

Therefore, to enable the Conductor node to communicate with the other nodes, you need to identify each node on the high-speed switch by its canonicalhostname and give its Ethernet name as its quoted attribute, as in the following configuration file. Configuration Files on page -15 discusses the keywords and syntax of Orchestrate configuration files.
{
node "node0" { fastname "node0_css" resource canonicalhostname "node1-eth-1" pools "" "node0" "node0_css" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {}

}
node "node1" { fastname "node1_css" resource canonicalhostname "node1-eth-1" pools "" "node1" "node1_css" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } node "node2" { fastname "node2_css" resource canonicalhostname "node1-eth-1" pools "" "node2" "node2_css" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } node "node3" { fastname "node3_css" resource canonicalhostname "node1-eth-1" pools "" "node3" "node3_css" resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk "/scratch" {} } node "node4" { pools "" "conductor" "node4" node4_css /* not in the default pool */ resource disk "/orch/s0" {} resource disk "/orch/s1" {} resource scratchdisk /scratch {} } }

Note Since node4 is not on the high-speed switch and we are therefore using it only as the Conductor node, we have left it out of the default node pool (""). This causes the parallel engine to avoid placing stages on node4. See Node Pools and the Default Node Pool on page -22.

58-14

Parallel Job Developers Guide

The Parallel Engine Configuration File

Configuration Files

Diagram of a Cluster Environment


The following figure shows a mixed MPP and SMP cluster environment containing six physical nodes. Only the four nodes of the left are intended to be allocated for use by the parallel engine. high-speed network (switch)

CPU

CPU

CPU

CPU

CPU CPU

CPU CPU CPU CPU

parallel engine processing

Configuration Files
This section describes parallel engine configuration files, and their uses and syntax. The parallel engine reads a configuration file to ascertain what processing and storage resources belong to your system. Processing resources include nodes; and storage resources include both disks for the permanent storage of data and disks for the temporary storage of data (scratch disks). The parallel engine uses this information to determine, among other things, how to arrange resources for parallel execution. You must define each processing node on which the parallel engine runs jobs and qualify its characteristics; you must do the same for each disk that will store data. You can specify additional information about nodes and disks on which facilities such as sorting or SAS operations will be run, and about the nodes on which to run stages that access the following relational data base management systems: DB2, INFORMIX, and Oracle. You can maintain multiple configuration files and read them into the system according to your needs. Orchestrate provides a sample configuration file, install_dir/etc/config.apt, where install_dir is the top-level directory of your parallel engine installation. This section contains the following subsections:

Parallel Job Developers Guide

58-15

Configuration Files

The Parallel Engine Configuration File

The Default Path Name and the APT_CONFIG_FILE on page -16 Syntax on page -16 Node Names on page -17 Options on page -18 Node Pools and the Default Node Pool on page -22 Disk and Scratch Disk Pools and Their Defaults on page -23 Buffer Scratch Disk Pools on page -24

The Default Path Name and the APT_CONFIG_FILE


The default name of the configuration file is config.apt. When you run a parallel job, the parallel engine searches for the file config.apt as follows: In the current working directory If it is not there, in install_dir/etc, where install_dir is the top-level directory of your parallel engine installation ($APT_ORCHHOME) You can give the configuration file a different name or location or both from their defaults. If you do, assign the new path and file name to the environment variable APT_CONFIG_FILE. If APT_CONFIG_FILE is defined, the parallel engine uses that configuration file rather than searching in the default locations. In a production environment, you can define multiple configurations and set APT_CONFIG_FILE to different path names depending on which configuration you want to use. You can set APT_CONFIG_FILE on a project wide level from the DataStage Administrator (see Setting Environment Variables in DataStage Administrator Guide) or for individual jobs from the Job Properties dialog (see Environment Variables on page -10).
Note Although the parallel engine may have been copied to all processing nodes, you need to copy the configuration file only to the nodes from which you start parallel engine applications (conductor nodes).

Syntax
Configuration files are text files containing string data that is passed to Orchestrate. The general form of a configuration file is as follows:
/* commentary */ { node "node name" { <node information> .

58-16

Parallel Job Developers Guide

The Parallel Engine Configuration File

Configuration Files

. . } . . . }

These are the syntactic characteristics of configuration files: Braces { } begin and end the file. The word node begins every node definition. The word node is followed by the name of the node enclosed in quotation marks. For a detailed discussion of node names, see Node Names on page -17. Braces { } follow the node name. They enclose the information about the node (its options), including an enumeration of each disk and scratch disk resource. The legal options are: fastname, pools, and resource. Spaces separate items. Quotation (") marks surround the attributes you assign to options, that is, the names of nodes, disks, scratch disks, and pools. Comments are demarcated by /* . . . */, in the style of the C programming language. They are optional, but are recommended where indicated in the examples.

Node Names
Each node you define is followed by its name enclosed in quotation marks, for example:
node "orch0"

For a single CPU node or workstation, the nodes name is typically the network name of a processing node on a connection such as a highspeed switch or Ethernet. Issue the following UNIX command to learn a nodes network name:
$ uname -n

On an SMP if you are defining multiple logical nodes corresponding , to the same physical node, you replace the network name with a logical node name. In this case, you need a fast name for each logical node. If you run an application from a node that is undefined in the corresponding configuration file, each user must set the environment

Parallel Job Developers Guide

58-17

Configuration Files

The Parallel Engine Configuration File

variable APT_PM_CONDUCTOR_NODENAME to the fast name of the node invoking the parallel job.

Options
Each node takes options that define the groups to which it belongs and the storage resources it employs. Options are as follows:
fastname

Syntax:

fastname "name"

This option takes as its quoted attribute the name of the node as it is referred to on the fastest network in the system, such as an IBM switch, FDDI, or BYNET. The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. For an SMP all , CPUs share a single connection to the network, and this setting is the same for all parallel engine processing nodes defined for an SMP . Typically, this is the principal node name, as returned by the UNIX command uname -n.
pools

Syntax: pools "node_pool_name0" "node_pool_name1"


...

The pools option indicates the names of the pools to which this node is assigned. The options attribute is the pool name or a spaceseparated list of names, each enclosed in quotation marks. For a detailed discussion of node pools, see Node Pools and the Default Node Pool on page -22. Note that the resource disk and resource scratchdisk options can also take pools as an option, where it indicates disk or scratch disk pools. For a detailed discussion of disk and scratch disk pools, see Disk and Scratch Disk Pools and Their Defaults on page -23. Node pool names can be dedicated. Reserved node pool names include the following names:
DB2 INFORMIX ORACLE

See the DB2 resource below and The resource DB2 Option on page -25. See the INFORMIX resource below and The resource INFORMIX Option on page -26. See the ORACLE resource below and The resource ORACLE option on page -27.

58-18

Parallel Job Developers Guide

The Parallel Engine Configuration File

Configuration Files

sas sort

See The SAS Resources on page -28. See Sort Configuration on page -29.

Reserved disk pool names include the following names: See Buffer Scratch Disk Pools on page -24.
export lookup sasdatas et sort

For use by the export stage. For use by the lookup stage. See The SAS Resources on page -28. See Sort Configuration on page -29.

resource

Syntax:

resource resource_type "location" [{pools "disk_pool_name"}]

|
resource resource_type "value"

The resource_type can be one of the following:


canonicalhostname

Syntax:

canonicalhostname

"ethernet name"

The canonicalhostname resource takes as its quoted attribute the ethernet name of a node in a cluster that is unconnected to the Conductor node by the high-speed network. If the Conductor node cannot reach the unconnected node by a fastname, you must define the unconnected nodes canonicalhostname to enable communication.
DB2

Syntax:

resource DB2 [{pools

"node_number" "instance_owner" ...}]

This option allows you to specify logical names as the names of DB2 nodes. For a detailed discussion of configuring DB2, see The resource DB2 Option on page -25.

Parallel Job Developers Guide

58-19

Configuration Files

The Parallel Engine Configuration File

disk

Syntax:

resource disk "directory_path" [{pools "poolname"...}]

Assign to this option the quoted absolute path name of a directory belonging to a file system connected to the node. The node reads persistent data from and writes persistent data to this directory. One node can have multiple disks. Relative path names are not supported. Typically, the quoted name is the root directory of a file system, but it does not have to be. For example, the quoted name you assign to disk can be a subdirectory of the file system. You can group disks in pools. Indicate the pools to which a disk belongs by following its quoted name with a pools definition enclosed in braces. For a detailed discussion of disk pools, see Disk and Scratch Disk Pools and Their Defaults on page -23.
INFORMIX

Syntax:

resource INFORMIX "coserver_basename" [{pools "db_server_name" ... }]

This option allows you to specify logical names as the names of INFORMIX nodes. For a detailed discussion of configuring INFORMIX, see The resource INFORMIX Option on page -26.
ORACLE

Syntax:

resource ORACLE "nodename" [{pools "db_server_name" ...}]

This option allows you to define the nodes on which Oracle runs. For a detailed discussion of configuring Oracle, see The resource ORACLE option on page -27.
sasworkdisk

Syntax:

resource sasworkdisk "directory_path" [{pools "poolname"...}]

This option is used to specify the path to your SAS work directory. See The SAS Resources on page -28.

58-20

Parallel Job Developers Guide

The Parallel Engine Configuration File

Configuration Files

scratchdisk

Syntax:

resource scratchdisk "directory_path" [{pools "poolname"...}]

Assign to this option the quoted absolute path name of a directory on a file system where intermediate data will be temporarily stored. All Orchestrate users using this configuration must be able to read from and write to this directory. Relative path names are unsupported. The directory should be local to the processing node and reside on a different spindle from that of any other disk space. One node can have multiple scratch disks. Assign at least 500 MB of scratch disk space to each defined node. Nodes should have roughly equal scratch space. If you perform sorting operations, your scratch disk space requirements can be considerably greater, depending upon anticipated use. We recommend that: Every logical node in the configuration file that will run sorting operations have its own sort disk, where a sort disk is defined as a scratch disk available for sorting that resides in either the sort or default disk pool. Each logical nodes sorting disk be a distinct disk drive. Alternatively, if it is shared among multiple sorting nodes, it should be striped to ensure better performance. For large sorting operations, each node that performs sorting have multiple distinct sort disks on distinct drives, or striped. You can group scratch disks in pools. Indicate the pools to which a scratch disk belongs by following its quoted name with a pools definition enclosed in braces. For more information on disk pools, see Disk and Scratch Disk Pools and Their Defaults on page -23. The following sample SMP configuration file defines four logical nodes.
{ node "borodin0" { fastname "borodin" pools "compute_1" "" resource disk "/sfiles/node0" {pools ""} resource scratchdisk "/scratch0" {pools "" "sort"} } node "borodin1" { fastname "borodin" pools "compute_1" "" resource disk "/sfiles/node1" {pools ""} resource scratchdisk "/scratch1" {pools "" "sort"} }

Parallel Job Developers Guide

58-21

Configuration Files

The Parallel Engine Configuration File

node "borodin2" { fastname "borodin" pools "compute_1" "" resource disk "/sfiles/node2" {pools ""} resource scratchdisk "/scratch2" {pools "" "sort"} } node "borodin3" { fastname "borodin" pools "compute_1" "" resource disk "/sfiles/node3" {pools ""} resource scratchdisk "/scratch3" {pools "" "sort"} } }

In the example shown above: All nodes are elements of pool compute_1 and the default node pool, indicated by "". The resource disk of node borodin0 is the directory /sfiles/node0. The resource disks of nodes borodin1 to borodin3 are the directories /sfiles/node1, /sfiles/node2, and /sfiles/node3. All resource disks are elements of the default disk pool, indicated by "". For sorting, each logical node has its own scratch disk. All scratch disks are elements of the sort scratch disk pool and the default scratch disk pool which is indicated by "".

Node Pools and the Default Node Pool


Node pools allow association of processing nodes based on their characteristics. For example, certain nodes can have large amounts of physical memory, and you can designate them as compute nodes. Others can connect directly to a mainframe or some form of highspeed I/O. These nodes can be grouped into an I/O node pool. The option pools is followed by the quoted names of the node pools to which the node belongs. A node can be assigned to multiple pools, as in the following example, where node1 is assigned to the default pool ("") as well as the pools node1, node1_css, and pool4.
node "node1" { fastname "node1_css" pools "" "node1" "node1_css" "pool4" resource disk "/orch/s0" {} resource scratchdisk "/scratch" {} }

A node belongs to the default pool unless you explicitly specify a pools list for it, and omit the default pool name ("") from the list.
58-22 Parallel Job Developers Guide

The Parallel Engine Configuration File

Configuration Files

Once you have defined a node pool, you can constrain a parallel stage or parallel job to run only on that pool, that is, only on the processing nodes belonging to it. If you constrain both an stage and a job, the stage runs only on the nodes that appear in both pools. Nodes or resources that name a pool declare their membership in that pool. We suggest that when you initially configure your system you place all nodes in pools that are named after the nodes name and fast name. Additionally include the default node pool in this pool, as in the following example:
node "n1" { fastname "nfast" pools "" "n1" "nfast" }

By default, the parallel engine executes a parallel stage on all nodes defined in the default node pool. You can constrain the processing nodes used by the parallel engine either by removing node descriptions from the configuration file or by constraining a job or stage to a particular node pool.

Disk and Scratch Disk Pools and Their Defaults


When you define a processing node, you can specify the options resource disk and resource scratchdisk. They indicate the directories of file systems available to the node. You can also group disks and scratch disks in pools. Pools reserve storage for a particular use, such as holding very large data sets. The syntax for setting up disk and scratch disk pools is as follows:
resource disk "disk_name" {pools "disk_pool0" ... "disk_poolN"} resource scratchdisk "s_disk_name" {pools "s_pool0" ... "s_poolN"}

where: disk_name and s_disk_name are the names of directories. disk_pool... and s_pool ... are the names of disk and scratch disk pools, respectively. Pools defined by disk and scratchdisk are not combined; therefore, two pools that have the same name and belong to both resource disk and resource scratchdisk define two separate pools. A disk that does not specify a pool is assigned to the default pool. The default pool may also be identified by "" by and { } (the empty pool list). For example, the following code configures the disks for node1:

Parallel Job Developers Guide

58-23

Configuration Files

The Parallel Engine Configuration File

node "node1" { resource disk "/orch/s0" {pools "" "pool1"} resource disk "/orch/s1" {pools "" "pool1"} resource disk "/orch/s2" { } /* empty pool list */ resource disk "/orch/s3" {pools "pool2"} resource scratchdisk "/scratch"{pools "" "scratch_pool1"} }

In this example: The first two disks are assigned to the default pool. The first two disks are assigned to pool1. The third disk is also assigned to the default pool, indicated by { }. The fourth disk is assigned to pool2 and is not assigned to the default pool. The scratch disk is assigned to the default scratch disk pool and to scratch_pool1. Application programmers make use of pools based on their knowledge of both their system and their application.

Buffer Scratch Disk Pools


Under certain circumstances, the parallel engine uses both memory and disk storage to buffer virtual data set records.The amount of memory defaults to 3 MB per buffer per processing node. The amount of disk space for each processing node defaults to the amount of available disk space specified in the default scratchdisk setting for the node. The parallel engine uses the default scratch disk for temporary storage other than buffering. If you define a buffer scratch disk pool for a node in the configuration file, the parallel engine uses that scratch disk pool rather than the default scratch disk for buffering, and all other scratch disk pools defined are used for temporary storage other than buffering. Here is an example configuration file that defines a buffer scratch disk pool:
{ node node1 { fastname "node1_css" pools "" "node1" "node1_css" resource disk "/orch/s0" {} resource scratchdisk "/scratch0" {pools "buffer"} resource scratchdisk "/scratch1" {} } node node2 { fastname "node2_css" pools "" "node2" "node2_css"

58-24

Parallel Job Developers Guide

The Parallel Engine Configuration File

The resource DB2 Option

resource disk "/orch/s0" {} resource scratchdisk "/scratch0" {pools "buffer"} resource scratchdisk "/scratch1" {} } }

In this example, each processing node has a single scratch disk resource in the buffer pool, so buffering will use /scratch0 but not /scratch1. However, if /scratch0 were not in the buffer pool, both /scratch0 and /scratch1 would be used because both would then be in the default pool.

The resource DB2 Option


The DB2 file db2nodes.cfg contains information for translating DB2 node numbers to node names. You must define the node names specified in db2nodes.cfg in your configuration file, if you want the parallel engine to communicate with DB2. You can designate each node specified in db2nodes.cfg in one of the following ways: By assigning to node its quoted network name, as returned by the UNIX operating system command uname -n; for example, node "node4". By assigning to node a logical name, for example "DB2Node3". If you do so, you must specify the option resource DB2 followed by the node number assigned to the node in db2nodes.cfg. The resource DB2 option can also take the pools option. You assign to it the user name of the owner of each DB2 instance configured to run on each node. DB2 uses the instance to determine the location of db2nodes.cfg. Here is a sample DB2 configuration:
{ node "Db2Node0" { /* other configuration parameters for node0 */ resource DB2 "0" {pools "Mary" "Tom"} } node "Db2Node1" { /* other configuration parameters for node1 */ resource DB2 "1" {pools "Mary" "Tom"} } node "Db2Node2" { /* other configuration parameters for node2 */ resource DB2 "2" {pools "Mary" "Tom" "Bill"} } node "Db2Node3" { /* other configuration parameters for node3 */ resource DB2 "3" {pools "Mary" "Bill"}

Parallel Job Developers Guide

58-25

The resource INFORMIX Option

The Parallel Engine Configuration File

} /* other nodes used by the parallel engine*/ }

In the example above: The resource DB2 option takes the DB2 node number corresponding to the processing node. All nodes are used with the DB2 instance Mary. Nodes 0, 1, and 2 are used with the DB2 instance Tom. Nodes 2 and 3 are used with the DB2 instance Bill. If you now specify a DB2 instance of Mary in your Orchestrate application, the location of db2nodes.cfg is ~Mary/sqllib/db2nodes.cfg.

The resource INFORMIX Option


To communicate with INFORMIX, the parallel engine must be configured to run on all processing nodes functioning as INFORMIX coservers. This means that the Orchestrate configuration must include a node definition for the coserver nodes. The list of INFORMIX coservers is contained in the file pointed to by the environment variable $INFORMIXSQLHOSTS or in the file $INFORMIXDIR/etc/sqlhosts. There are two methods for specifying the INFORMIX coserver names in the Orchestrate configuration file.
1

Your Orchestrate configuration file can contain a description of each node, supplying the node name (not a synonym) as the quoted name of the node. Typically, the node name is the network name of a processing node as returned by the UNIX command uname -n. Here is a sample configuration file for a system containing INFORMIX coserver nodes node0, node1, node2, and node3:

{ node "node0" { /* configuration parameters for node0 */ } node "node1" { /* configuration parameters for node1 */ } node "node2" { /* configuration parameters for node2 */ }

58-26

Parallel Job Developers Guide

The Parallel Engine Configuration File

The resource ORACLE option

node "node3" { /* configuration parameters for node3 */ } /* other nodes used by the parallel engine*/ }

You can supply a logical rather than a real network name as the quoted name of node. If you do so, you must specify the resource INFORMIX option followed by the name of the corresponding INFORMIX coserver. Here is a sample INFORMIX configuration:

{ node "IFXNode0" { /* other configuration parameters for node0 resource INFORMIX "node0" {pools "server"} } node "IFXNode1" { /* other configuration parameters for node1 resource INFORMIX "node1" {pools "server"} } node "IFXNode2" { /* other configuration parameters for node2 resource INFORMIX "node2" {pools "server"} } node "IFXNode3" { /* other configuration parameters for node3 resource INFORMIX "node3" {pools "server"} } /* other nodes used by the parallel engine*/ } */

*/

*/

*/

When you specify resource INFORMIX, you must also specify the pools parameter. It indicates the base name of the coserver groups for each INFORMIX server. These names must correspond to the coserver group base name using the shared-memory protocol. They also typically correspond to the DBSERVERNAME setting in the ONCONFIG file. For example, coservers in the group server are typically named server.1, server.2, and so on.

The resource ORACLE option


By default, the parallel engine executes Oracle stages on all processing nodes belonging to the default node pool, which typically corresponds to all defined nodes. You can optionally specify the resource ORACLE option to define the nodes on which you want to run the Oracle stages. If you do, Orchestrate runs the Oracle stages only on the processing nodes for

Parallel Job Developers Guide

58-27

The SAS Resources

The Parallel Engine Configuration File

which resource ORACLE is defined. You can additionally specify the pools parameter of resource ORACLE to define resource pools, which are groupings of Oracle nodes. Here is a sample Oracle configuration:
{ node "node0" { /* other configuration parameters for node0 */ resource ORACLE "node0" {pools "group1" "group2" "group3"} } node "node1" { /* other configuration parameters for node1 */ resource ORACLE "node1" {pools "group1" "group2"} } node "node2" { /* other configuration parameters for node2 */ resource ORACLE "node2" {pools "group1" "group3"} } node "node3" { /* other configuration parameters for node3 */ resource ORACLE "node3" {pools "group1" "group2" } /* any other nodes used by the parallel engine*/ }

"group3"}

In the example above, Oracle runs on node0 to node3. node0node3 are used with node pool group1. node0, node1, and node3 are used with node pool group2. node0, node2, and node3 are used with node pool group3.

The SAS Resources


Adding SAS Information to your Configuration File
To configure your system to use the SAS stage, you need to specify the following information in your configuration file: The location of the SAS executable, if it is not in your PATH; An SAS work disk directory, one for each parallel engine node; Optionally, a disk pool specifically for parallel SAS data sets, called sasdataset. The resource names sas and sasworkdisk and the disk pool name sasdataset are all reserved words. Here is an example of each of these declarations:

58-28

Parallel Job Developers Guide

The Parallel Engine Configuration File

Sort Configuration

resource sas "/usr/sas612/" { } resource sasworkdisk "/usr/sas/work/" { } resource disk "/data/sas/" {pools "" "sasdataset"}

While the disks designated as sasworkdisk need not be a RAID configuration, best performance will result if each parallel engine logical node has its own reserved disk that is not shared with other parallel engine nodes during sorting and merging. The total size of this space for all nodes should optimally be equal to the total work space you use when running SAS sequentially (or a bit more, to allow for uneven distribution of data across partitions). The number of disks in the sasdataset disk pool is the degree of parallelism of parallel SAS data sets. Thus if you have 24 processing nodes, each with its associated disk in the sasdataset disk pool, parallel SAS data sets will be partitioned among all 24 disks, even if the operation preceding the disk write is, for example, only four-way parallel.

Example
Here a single node, grappelli0, is defined, along with its fast name. Also defined are the path to a SAS executable, a SAS work disk (corresponding to the SAS work directory), and two disk resources, one for parallel SAS data sets and one for non-SAS file sets.
node "grappelli0" { fastname "grappelli" pools "" "a" resource sas "/usr/sas612" { } resource scratchdisk "/scratch" { } resource sasworkdisk "/scratch" { } disk "/data/pds_files/node0" { pools "" "export" } disk "/data/pds_files/sas" { pools "" "sasdataset" } }

Sort Configuration
You may want to define a sort scratch disk pool to assign scratch disk space explicitly for the storage of temporary files created by the Sort stage. In addition, if only a subset of the nodes in your configuration have sort scratch disks defined, we recommend that you define a sort node pool, to specify the nodes on which the sort stage should run. Nodes assigned to the sort node pool should be those that have scratch disk space assigned to the sort scratch disk pool. The parallel engine then runs sort only on the nodes in the sort node pool, if it is defined, and otherwise uses the default node pool. The
Parallel Job Developers Guide 58-29

Allocation of Resources

The Parallel Engine Configuration File

Sort stage stores temporary files only on the scratch disks included in the sort scratch disk pool, if any are defined, and otherwise uses the default scratch disk pool. When the parallel engine runs, it determines the locations of temporary files by:
1

Searching the parallel engine configuration for any scratch disk resources in the sort resource pool on the nodes sort will run on. If found, the scratch disks are used as a location for temporary storage by sort. If no scratch disk resources are found that belong to the disk pool sort, the system determines whether any scratch disk resources belong to the default scratch disk pool on the nodes sort will run on. If so, the scratch disks belonging to the default pool are used by tsort for temporary storage. If no scratch disk resources are found that belong to either sort or the default scratch disk pool, the parallel engine issues a warning message and runs sort using the directory indicated by the TMPDIR environment variable or /tmp for temporary storage.

Allocation of Resources
The allocation of resources for a given stage, particularly node and disk allocation, is done in a multi-phase process. Constraints on which nodes and disk resources are used are taken from the parallel engine arguments, if any, and matched against any pools defined in the configuration file. Additional constraints may be imposed by, for example, an explicit requirement for the same degree of parallelism as the previous stage. After all relevant constraints have been applied, the stage allocates resources, including instantiation of Player processes on the nodes that are still available and allocation of disks to be used for temporary and permanent storage of data.

Selective Configuration with Startup Scripts


As part of running an application, the parallel engine creates a remote shell on all parallel engine processing nodes on which the application will be executed. After the parallel engine creates the remote shell, it copies the environment from the system on which the application was invoked to each remote shell. This means that all remote shells have the same configuration by default.

58-30

Parallel Job Developers Guide

The Parallel Engine Configuration File

Selective Configuration with Startup Scripts

However, you can override the default and set configuration parameters for individual processing nodes. To do so, you create a parallel engine startup script. If a startup script exists, the parallel engine runs it on all remote shells before it runs your application. When you invoke an application, the parallel engine looks for the name and location of a startup script as follows:
1 2 3

It uses the value of the APT_STARTUP_SCRIPT environment variable. It searches the current working directory for a file named startup.apt. Searches for the file install_dir/etc/startup.apt on the system that invoked the parallel engine application, where install_dir is the top-level directory of the installation. If the script is not found, it does not execute a startup script.

Here is a template you can use with Korn shell to write your own startup script.
#!/bin/ksh # specify Korn shell # your shell commands go here shift 2 exec $* # required for all shells # required for all shells

You must include the last two lines of the shell script. This prevents your application from running if your shell script detects an error. The following startup script for the Bourne shell prints the node name, time, and date for all processing nodes before your application is run:
#!/bin/sh # specify Bourne shell echo hostname date shift 2 exec $*

A single script can perform node-specific initialization by means of a case statement. In the following example, the system has two nodes named node1 and node2. This script performs initialization based on which node it is running on.
#!/bin/sh # use Bourne shell # Example APT startup script. case `hostname` in node1) # perform node1 init

node-specific directives
;; node2) # perform node2 init

Parallel Job Developers Guide

58-31

Hints and Tips

The Parallel Engine Configuration File

node-specific directives
;; esac shift 2 exec $*

The parallel engine provides the APT_NO_STARTUP_SCRIPT environment variable to prevent the parallel engine from running the startup script. By default, the parallel engine executes the startup script. If the variable is set, the parallel engine ignores the startup script. This can be useful for debugging a startup script.

Hints and Tips


The configuration file tells the engine how to exploit the underlying computer system. For a given system there is not necessarily one ideal configuration file because of the high variability between the way different jobs work. So where do you start? Let's assume you are running on a shared-memory multi-processor system, i.e., an SMP box (these are the most common platforms today). Let's assume these properties. You can adjust the illustration below to match your precise situation: computer's hostname "fastone" 6 CPUs 4 separate file systems on 4 drives named /fs0 /fs1 /fs2 /fs3 The configuration file to use as a starting point would look like the one below. Note the way the disk/scratchdisk resources are handled. That's the real trick here.
{ /* config file allows C-style comments. */ /* config files look like they have flexible syntax. They do NOT. Keep all the sub-items of the individual node specifications in the order shown here. */ node "n0" { pools "" /* on an SMP node pools aren't used often. */ fastname "fastone" resource scratchdisk "/fs0/ds/scratch" {} /*start with fs0*/ resource scratchdisk "/fs1/ds/scratch" {} resource scratchdisk "/fs2/ds/scratch" {} resource scratchdisk "/fs3/ds/scratch" {} resource disk "/fs0/ds/disk" {} /* start with fs0 */ resource disk "/fs1/ds/disk" {} resource disk "/fs2/ds/disk" {} resource disk "/fs3/ds/disk" {} }

58-32

Parallel Job Developers Guide

The Parallel Engine Configuration File

Hints and Tips

node "n1" {pools "" fastname "fastone" resource scratchdisk "/fs1/ds/scratch" {} /*start with fs1*/ resource scratchdisk "/fs2/ds/scratch" {} resource scratchdisk "/fs3/ds/scratch" {} resource scratchdisk "/fs0/ds/scratch" {} resource disk "/fs1/ds/disk" {} /* start with fs1 */ resource disk "/fs2/ds/disk" {} resource disk "/fs3/ds/disk" {} resource disk "/fs0/ds/disk" {} } node "n2" { pools "" fastname "fastone" resource scratchdisk "/fs2/ds/scratch" {} /*start with fs2*/ resource scratchdisk "/fs3/ds/scratch" {} resource scratchdisk "/fs0/ds/scratch" {} resource scratchdisk "/fs1/ds/scratch" {} resource disk "/fs2/ds/disk" {} /* start with fs2 */ resource disk "/fs3/ds/disk" {} resource disk "/fs0/ds/disk" {} resource disk "/fs1/ds/disk" {} } node "n3" { pools "" fastname "fastone" resource scratchdisk "/fs3/ds/scratch" {} /*start with fs3*/ resource scratchdisk "/fs0/ds/scratch" {} resource scratchdisk "/fs1/ds/scratch" {} resource scratchdisk "/fs2/ds/scratch" {} resource disk "/fs3/ds/disk" {} /* start with fs3 */ resource disk "/fs0/ds/disk" {} resource disk "/fs1/ds/disk" {} resource disk "/fs2/ds/disk" {} } node "n4" { pools "" fastname "fastone" /* * Ok, now what. We rotated through starting with a * different disk, but we have a basic problem here which is * that there are more CPUs than disks. So what do we do * now? The answer: something that is not perfect. We're * going to repeat the sequence. You could shuffle * differently i.e., use /fs0 /fs2 /fs1 /fs3 as an order. * I'm not sure it will matter all that much. */ resource scratchdisk "/fs0/ds/scratch" {} /*start with fs0 again*/ resource scratchdisk "/fs1/ds/scratch" {} resource scratchdisk "/fs2/ds/scratch" {} resource scratchdisk "/fs3/ds/scratch" {} resource disk "/fs0/ds/disk" {} /* start with fs0 again */ resource disk "/fs1/ds/disk" {} resource disk "/fs2/ds/disk" {} resource disk "/fs3/ds/disk" {} }

Parallel Job Developers Guide

58-33

Hints and Tips

The Parallel Engine Configuration File

node "n5" { pools "" fastname "fastone" resource scratchdisk "/fs1/ds/scratch" {} /*start with fs1*/ resource scratchdisk "/fs2/ds/scratch" {} resource scratchdisk "/fs3/ds/scratch" {} resource scratchdisk "/fs0/ds/scratch" {} resource disk "/fs1/ds/disk" {} /* start with fs1 */ resource disk "/fs2/ds/disk" {} resource disk "/fs3/ds/disk" {} resource disk "/fs0/ds/disk" {} }

} /* end of whole config */

The above config file pattern could be called "give everyone all the disk". This configuration style works well when the flow is complex enough that you can't really figure out and precisely plan for good I/O utilization. Giving every partition (node) access to all the I/O resources can cause contention, but the parallel engine tends to use fairly large blocks for I/O so the contention isn't as much of a problem as you might think. This configuration style works for any number of CPUs and any number of disks since it doesn't require any particular correspondence between them. The heuristic principle at work here is this "When it's too difficult to figure out precisely, at least go for achieving balance." The alternative to the above configuration style is more careful planning of the I/O behavior so as to reduce contention. You can imagine this could be hard given our hypothetical 6-way SMP with 4 disks because setting up the obvious one-to-one correspondence doesn't work. Doubling up some nodes on the same disk is unlikely to be good for overall performance since we create a hotspot. We could give every CPU 2 disks and rotate around, but that would be little different than our above strategy. So, let's imagine a less constrained environment and give ourselves 2 more disks /fs4 and /fs5. Now a config file might look like this:
{ node "n0" { pools "" fastname "fastone" resource scratchdisk "/fs0/ds/scratch" {} resource disk "/fs0/ds/disk" {} } node "n1" { pools "" fastname "fastone" resource scratchdisk "/fs1/ds/scratch" {} resource disk "/fs1/ds/disk" {} }

58-34

Parallel Job Developers Guide

The Parallel Engine Configuration File

Hints and Tips

node "n2" { pools "" fastname "fastone" resource scratchdisk "/fs2/ds/scratch" resource disk "/fs2/ds/disk" {} } node "n3" { pools "" fastname "fastone" resource scratchdisk "/fs3/ds/scratch" resource disk "/fs3/ds/disk" {} } node "n4" { pools "" fastname "fastone" resource scratchdisk "/fs4/ds/scratch" resource disk "/fs4/ds/disk" {} } node "n5" { pools "" fastname "fastone" resource scratchdisk "/fs5/ds/scratch" resource disk "/fs5/ds/disk" {} } } /* end of whole config */

{}

{}

{}

{}

This is simplest, but realize that no single player (stage/operator instance) on any one partition can go faster than the single disk it has access to. You could combine strategies by adding in a node pool where disks have this one-to-one association with nodes. These nodes would then not be in the default node pool, but a special one that you would assign stages/operators to specifically. Other configuration file hints: Consider avoiding the disk/disks that your input files reside on. Often those disks will be hotspots until the input phase is over. If the job is large and complex this is less of an issue since the input part is proportionally less of the total work. Ensure that the different file systems mentioned as the disk and scratchdisk resources hit disjoint sets of spindles even if they're located on a RAID system. Know what is real and what is NFS: Real disks are directly attached, or are reachable over a SAN (storage-area network dedicated, just for storage, low-level protocols).

Never use NFS file systems for scratchdisk resources. If you use NFS file system space for disk resources, then you need to know what you are doing. For example, your final result files may need to be written out onto the NFS disk area, but that doesn't mean the intermediate data sets created and

Parallel Job Developers Guide

58-35

Hints and Tips

The Parallel Engine Configuration File

used temporarily in a multi-job sequence should use this NFS disk area. Better to setup a "final" disk pool, and constrain the result sequential file or data set to reside there, but let intermediate storage go to local or SAN resources, not NFS. Know what data points are striped (RAID) and which are not. Where possible, avoid striping across data points that are already striped at the spindle level.

58-36

Parallel Job Developers Guide

59
SQL Builder
The SQL Builder provides a graphical interface that helps you build SQL SELECT statements. These statements allow you to select rows of data from your relational database. The statement can be a simple one that selects rows from a single table, or it can be complex, performing joins between multiple tables or aggregations of values within columns. In Parallel jobs you can invoke the SQL Builder from: DB2/UDB Enterprise stage Oracle Enterprise stage Different databases have slightly different SQL syntax (particularly when it comes to more complex operations such as joins). The exact form of the SQL statements that the SQL builder produces depends on which stage you invoked it from. The examples given here are based on the Oracle Enterprise stage. You do not have to be an SQL expert to use the SQL Builder, but we assume some familiarity with the basic structure of an SQL query in this documentation.

How to Use the SQL Builder


You reach the SQL Builder through the stage editors:
1 2

Choose a Read Method property of SQL Builder generated SQL in the Output page Properties tab. Select the SQL Query property, the SQL Query window will initially be blank.

Parallel Job Developers Guide

59-1

How to Build Queries with the SQL Builder

SQL Builder

Choose Build New Query Syntax from the Right-Arrow menu (where Syntax indicates the vesrion of the Database you are querying). The SQL builder opens. When you have constructed the required query, it appears in the SQL Query window on the stage editor. and from here you can edit it, if required.

Note If you have previously built a query using a Read Method of Auto-generated SQL or User-defined SQL, this will be lost if you then select a Read Method of SQL builder generated SQL.

If you have constructed a query using the SQL Builder, it will be retained if you switch to a Read Method of Userdefined SQL, but lost if you choose Auto-generated SQL.

How to Build Queries with the SQL Builder


This section describes the general steps you have to take when using the SQL Builder to construct a query. The examples on page 59-29 give you more detailed instructions for building different types of query. To use the SQL Builder:

59-2

Parallel Job Developers Guide

SQL Builder

How to Build Queries with the SQL Builder

Ensure that the SQL Builder has the Selection tab on top.

Drag any tables you want to include in your query from the repository window to the canvas (you must have previously placed the table definitions in the DataStage repository the easiest way to do this is to import the definitions directly from your relational database). You can drag multiple tables onto the canvas to enable you to specify complex queries such as joins. Specify the columns that you want to select from the table or tables on the column selection grid. If you want to refine the selection you are performing, choose a predicate from the Predicate list in the filter panel. Then use the expression editor to specify the actual filter (the fields displayed depend on the predicate you choose). For example, use the Comparison predicate to specify that a column should match a particular value, or the Between predicate to specify that a column falls within a particular range. The filter appears as a WHERE clause in the finished query. Click the Add button in the filter panel. The filter that you specify appears in the filter expression panel and is added to the SQL statement that you are building. If you are joining multiple tables, and the automatic joins inserted by the SQL Builder are not what's required, manually alter the joins.

3 4

Parallel Job Developers Guide

59-3

How to Build Queries with the SQL Builder

SQL Builder

If you want to group your results according to the values in certain columns, select the Group tab. Select the Grouping check box in the column grouping and aggregation grid for the column or columns that you want to group the results by. If you want to aggregate the values in the columns, you should also select the Group tab. Select the aggregation that you want to perform on a column from the Aggregation drop-down list in the column grouping and aggregation grid.

59-4

Parallel Job Developers Guide

SQL Builder

Selection Tab

Click on the Sql tab to view the finished query, and to resolve the columns generated by the SQL statement with the columns loaded on the stage (if necessary)..

Selection Tab
When the SQL Builder opens, it has the Selection tab on top (see page 59-3). This has the components described in the following sections.

Toolbar
The toolbar contains various tools:

clear click this to completely clear the query you are currently building. cut allows certain items to be removed and placed on the clipboard so they can be pasted elsewhere.

Parallel Job Developers Guide

59-5

Selection Tab

SQL Builder

copy allows certain items to be copied and placed on the clipboard so they can be pasted elsewhere. paste allows you to paste items from the clipboard to certain places in the SQL Builder. SQL properties opens the Properties dialog box. quoting toggles between having table and column names in quotation marks in the SQL statement and having them unquoted. validation toggles the validation feature on and off. view data this is available when you invoke the SQL Builder from stages that support the viewing of data. It causes the calling stage to run the SQL as currently built and return the results for you to view. refresh refreshes the contents of all the panels on the SQL Builder. window view allows you to select which items are shown in the SQL Builder. help opens the online help.

Repository Window
This displays the table definitions that currently exist within the DataStage repository. The easiest way to get a table definition into the repository is to import it directly from the database you want to query, you can do this via the DataStage Designer or DataStage Manager, or you can do it directly from the shortcut menu in the repository tree. You can also manually define a table definition from within the SQL Builder by selecting New Table... from the repository window shortcut menu. To select a table to query, select it in the repository window and drag it to the table selection canvas. A window appears in the canvas representing the table and listing all its individual columns. A shortcut menu allows you to: Refresh the repository view Define a new table definition (the Table Definition dialog box opens) Import Meta Data directly from a data source (a sub menu offers a list of source types) Copy a table definition (you can paste it in the table selection canvas)

59-6

Parallel Job Developers Guide

SQL Builder

Selection Tab

View the properties of the table definition (the Table Definition dialog box opens) You can also view the properties of a table definition by doubleclicking on it in the repository window.

Table Selection Canvas


The canvas allows you to define the tables that are used in the query. Drag a table form the repository window, and it will appear as a window on the canvas, listing all the columns in the table and their types. (If the desired table does not exist in the repository, you can import it from the database you are querying by choosing Import Meta Data from the repository window shortcut menu.) Wherever you try to place the table on the canvas, the first table you drag will always be placed in the top left hand corner. Subsequent tables can be dragged before or after the initial, or on a new row underneath. Eligible areas are highlighted on the canvas as you drag the table, and you can only drop a table in one of the highlighted areas. When you place tables on the same row, the SQL Builder will automatically join the tables (you can alter the join if it's not what you want). When you place tables on a separate row, no join is added and you will get a cartesian product (otherwise known as a 'cross-join') of the tables on the different rows, unless you explicitly join the tables. For details about joining tables, see "Joining Tables" on page 59-23. Click the Select All button underneath the table title bar to select all the columns in the table. With a table selected in the canvas, a shortcut menu allows you to: Add a related table. A submenu shows you tables that have a foreign key relationship with the currently selected one. Select a table to insert it in the canvas, together with the type of join inferred by the foreign key relationship. Remove the selected table. Select all the columns in the table (so that you could, for example, drag them all to the column selection grid). Open a Select Table dialog box to allow you to bind an alternative table definition in the repository to the currently selected table. Open the Table Properties dialog box for the currently selected table. With a join selected in the canvas, a shortcut menu allows you to:

Parallel Job Developers Guide

59-7

Selection Tab

SQL Builder

Open the Alternate Relation dialog box to specify that the join should be based on a different foreign key relationship. Open the Join Properties dialog box. From the canvas background, a shortcut menu allows you to: Refresh the view of the table selection canvas. Paste a table that you have copied from the repository window. View data this is available when you invoke the SQL Builder from stages that support the viewing of data. It causes the calling stage to run the SQL as currently built and return the results for you to view. Open the Properties dialog box to view details of the SQL syntax that the SQL Builder is currently building a query for.

Column Selection Grid


This is where you specify which columns are to be included in your query. You can populate the grid in two ways: drag columns from the tables in the table selection canvas. choose columns from a drop-down list in the grid. copy and paste from the table selection canvas. The grid has the following fields:

Column expression
Identifies the column to be included in the query. You can specify one of the following in identifying a column: Job parameter. A dialog box appears offering you a choice of available job parameters. This allows you to specify the value to be used in the query at run time (the stage you are using the SQL Builder from must allow job parameters for this to appear). Expression. An expression editor dialog box appears, allowing you to specify an expression that represents the value to be used in the query. Data flow variable. A dialog box appears offering you a choice of available data flow variables (the stage you are using the SQL Builder from must support data flow variables for this to appear) Column. You can directly select a column from one of the tables in the table selection canvas.

59-8

Parallel Job Developers Guide

SQL Builder

Selection Tab

Table
Identifies the table that the column belongs to. If you populate the column grid by dragging a column from the table selection canvas, the table is filled in automatically. Otherwise, choose a table from the drop-down list, or choose a job parameter to enable you to specify the table name at run time.

Column Alias
This allows you to specify an alias for the column.

Output
This is selected to indicate that the column will be output by the query. This is automatically selected when you add a column to the grid.

Sort
Choose Ascending or Descending to have the query sort the returned rows by the value of this column. Selecting to sort results in an ORDER BY clause being added to the query.

Sort Order
Allows you to specify the order in which rows are sorted if you are ordering by more than one column. A shortcut menu allows you to: Paste a column that you've copied from the table selection canvas Refresh the view of the grid Show or hide the filter panel Remove a row from the grid

Filter Panel
The filter panel allows you to specify a WHERE clause for the SELECT statement you are building. It comprises a predicate list and an expression editor panel, the contents of which depends on the chosen predicate. See "Expression Editor" on page 59-14 for details on using the expression editor that the filter panel provides.

Parallel Job Developers Guide

59-9

Group Tab

SQL Builder

Filter Expression Panel


This panel displays any filters that you have added to the query being built. You can edit the filter manually in this panel. Alternatively you can type a filter straight in, without using the filter expression editor.

Group Tab
The Group tab (see page 59-4) allows you to specify that the results of a query are grouped by a column, or columns. Also, it allows you to aggregate the results in some of the columns, for example, you could specify COUNT to count the number of rows that contain a not-null value in a column. The Group tab gives access to the toolbar (see page 59-5), repository window (see page 59-6), and the table selection canvas (see page 59-7) in exactly the same way as the Selection tab. Other components are described in the following sections.

Grouping Grid
This is where you specify which columns are to be grouped by or aggregated on. The grid is populated with the columns that you selected on the Selection tab. although you can change the selected columns or select new ones, which will be reflected in the selection your query makes. The grid has the following fields: Column expression. Identifies the column to be included in the query. If you want to change the column selections that were made on the Selection tab, you can specify one of the following in identifying a column:

Job parameter. A dialog box appears offering you a choice of available job parameters. This allows you to specify the value to be used in the query at run time (the stage you are using the SQL Builder from must allow job parameters for this to appear). Expression. An expression editor dialog box appears, allowing you to specify an expression that represents the value to be used in the query. Data flow variable. A dialog box appears offering you a choice of available data flow variables (the stage you are using the SQL Builder from must support data flow variables for this to appear).

59-10

Parallel Job Developers Guide

SQL Builder

Group Tab

Column. You can directly select a column from one of the tables in the table selection canvas.

Column Alias. This allows you to specify an alias for the column. If you select an aggregation operation for a column, SQL Builder will automatically insert an alias of the form alias_n; you can edit this if required. Output. This is selected to indicate that the column will be output by the query. This is automatically selected when you add a column to the grid. Distinct. Select this check box if you want to add the DISTINCT qualifier to an aggregation. For example, a COUNT aggregation with the distinct qualifier will count the number of rows with distinct values in a field (as opposed to just the not-null values). (Not all SQL syntaxes support the DISTINCT feature - see "SQL Properties Dialog Box" on page 59-29.) Aggregation. Allows you to select an aggregation function to apply to the column (note that this is mutually exclusive with the Group by option). See "Aggregation Functions" on page 59-11 for details about the available functions. Group By. Select the check box to specify that query results should be grouped by the results in this column.

Aggregation Functions
The aggregation functions available vary according to the stage you have opened the SQL Builder from. The following are the basic ones supported by all SQL syntax variants: AVG. This returns the mean average of the values in a column. For example, if you had six rows with a column containing a price, the six rows would be added together and divided by six to yield the mean average. If you specify the DISTINCT qualifier, only distinct values will be averaged; if our six rows only contained four distinct prices then these four would be added together and divided by four to produce a mean average. COUNT. This counts the number of rows that contain a not-null value in a column. If you specify the DISTINCT qualifier, only distinct values will be counted. MAX. This returns the maximum value that the rows hold in a particular column. (The DISTINCT qualifier can be selected, but has no affect on this function). MIN. This returns the minimum value that the rows hold in a particular column. (The DISTINCT qualifier can be selected, but has no affect on this function).

Parallel Job Developers Guide

59-11

Sql Tab

SQL Builder

The SQL Builder offers additional aggregation functions according to what is supported by the database you are building the query for.

Filter Panel
The filter panel allows you to specify a HAVING clause for the SELECT statement you are building. It comprises a predicate list and an expression editor panel, the contents of which depends on the chosen predicate. See "Expression Editor" on page 59-14 for details on using the expression editor that the filter panel provides.

Filter Expression Panel


This panel displays any filters that you have added to the query being built. You can edit the filter manually in this panel. Alternatively you can type a filter straight in, without using the filter panel.

Sql Tab
Visit the Sql tab (see page 59-5) at any time to view the query being built. This tab displays the SQL statement as it currently stands. You cannot edit the statement here, but a shortcut menu allows you to copy and paste text. If the columns you have defined as output columns for your stage do not match the columns the SQL statement is generating, you can use the Resolve columns grid to reconcile them (in most cases they would match).

Resolve Columns Grid


If the columns you have loaded onto your stage editor (the loaded columns) do not match the columns generated by the SQL statement (the result columns) you have defined, the Resolve columns grid gives you the opportunity to reconcile them. Ideally the columns should match (and in normal circumstances usually would). A mismatch would cause the meta data in your job to become out of step with the meta data as loaded from your source database (which could cause a problem if you are performing usage analysis based on that table). If there is a mismatch, the grid displays a warning message. Click the Auto Match button to resolve the mismatch. You are offered the choice of matching by name, by order, or by both. When matching, the SQL

59-12

Parallel Job Developers Guide

SQL Builder

Sql Tab

builder seeks to alter the columns generated by the SQL statement to match the columns loaded onto the stage. If you choose Name matching, and a column of the same name with a compatible data type is found, the SQL builder: Moves the result column to the equivalent position in the grid to the loaded column (this will change the position of the named column in the SQL). Modifies all the attributes of the result column to match those of the loaded column. If you choose Order matching, the builder works through comparing each results column to the loaded column in the equivalent position. If a mismatch is found, and the data type of the two columns is compatible, the SQL builder: Changes the name of the result column to match the loaded column (provided the results set does not already include a column of than name). Modifies all the attributes of the result column to match those of the loaded column. If you choose Both, the SQL Builder applies Name matching and then Order matching. If auto matching fails to reconcile the columns as described above, any mismatched results column that represents a single column in a table is overwritten with the details of the loaded column in the equivalent position. When you click OK in the SQL tab, the SQL builder checks to see if the results columns match the loaded columns. If they don't, a warning message is displayed allowing you to proceed or cancel. Proceeding causes the loaded columns to be merged with the results columns: Any matched columns are not affected. Any extra columns in the results columns are added to the loaded columns. Any columns in the loaded set that do not appear in the results set are removed. For columns that don't match, if data types are compatible the loaded column is overwritten with the results column. If data types are not compatible, the existing loaded column is removed and replaced with the results column. You can also edit the columns in the Results part of the grid in order to reconcile mismatches manually.

Parallel Job Developers Guide

59-13

Expression Editor

SQL Builder

Expression Editor
The Expression Editor allows you to specify details of a WHERE clause that will be inserted in your query, or of a Join condition where you are joining multiple tables, or of a HAVING clause. A variant of the expression editor allows you to specify a calculation or a function within a function (see . The Expression Editor can be opened from various places in theSQL Builder.

Main Expression Editor


To specify an expression: Choose the type of filter by choosing a predicate from the list. Fill in the information required by the Expression Editor fields that appear. Click the Add button to add the filter to the query you are building. This clears the expression editor so that you can add another filter if required. The contents of the expression editor vary according to which predicate you have selected. The following predicates are available: Between. Allows you to specify that the value in a column should lay within a certain range. Comparison. Allows you to specify that the value in a column should be equal to, or greater than or less than, a certain value. In. Allows you to specify that the value in a column should match one of a list ofvalues. Like. Allows you to specify that the value in a column should contain, start with, end with, or match a certain value. Null. Allows you to specify that a column should, or should not be, null. If you are building an Oracle 8i query, an additional predicate is available: Join. Allows you to specify a join. This appears in the query as a WHERE statement (Oracle 8i does not support JOIN statements).

59-14

Parallel Job Developers Guide

SQL Builder

Expression Editor

Between
The expression editor when you have selected the Between predicate is as follows:

The fields it contains are: Column. Choose the column on which you are filtering from the drop-down list. You can specify one of the following in identifying a column:

Job parameter. A dialog box appears offering you a choice of available job parameters. This allows you to specify the value to be used in the query at run time (the stage you are using the SQL Builder from must allow job parameters for this to appear). Expression. An expression editor dialog box appears, allowing you to specify an expression that represents the value to be used in the query. Data flow variable. A dialog box appears offering you a choice of available data flow variables (the stage you are using the SQL Builder from must support data flow variables for this to appear) Column. You can directly select a column from one of the tables in the table selection canvas.

Between/Not Between. Choose Between or Not Between from the drop-down list to specify whether the value you are testing should be inside or outside your specified range. Start of range. Use this field to specify the start of your range. Click the menu button to the right of the field and specify details about the argument you are using to specify the start of the range, then specify the value itself in the field. End of range. Use this field to specify the end of your range. Click the menu button to the right of the field and specify details about the argument you are using to specify the end of the range, then specify the value itself in the field.

Parallel Job Developers Guide

59-15

Expression Editor

SQL Builder

Comparison
The expression editor when you have selected the Comparison predicate is as follows:

The fields it contains are: Column. Choose the column on which you are filtering from the drop-down list. You can specify one of the following in identifying a column:

Job parameter. A dialog box appears offering you a choice of available job parameters. This allows you to specify the value to be used in the query at run time (the stage you are using the SQL Builder from must allow job parameters for this to appear). Expression. An expression editor dialog box appears, allowing you to specify an expression that represents the value to be used in the query. Data flow variable. A dialog box appears offering you a choice of available data flow variables (the stage you are using the SQL Builder from must support data flow variables for this to appear) Column. You can directly select a column from one of the tables in the table selection canvas.

Comparison operator. Choose the comparison operator from the drop-down list. The available operators are:

= equals <> not equal to < less than <= less than or equal to > greater than >= greater than or equal to

Comparison value. Use this field to specify the value you are comparing to. Click the menu button to the right of the field and choose the data type for the value from the menu, then specify the value itself in the field.

59-16

Parallel Job Developers Guide

SQL Builder

Expression Editor

In
The expression editor when you have selected the In predicate is as follows:

The fields it contains are: Column. Choose the column on which you are filtering from the drop-down list. You can specify one of the following in identifying a column:

Job parameter. A dialog box appears offering you a choice of available job parameters. This allows you to specify the value to be used in the query at run time (the stage you are using the SQL Builder from must allow job parameters for this to appear). Expression. An expression editor dialog box appears, allowing you to specify an expression that represents the value to be used in the query. Data flow variable. A dialog box appears offering you a choice of available data flow variables (the stage you are using the SQL Builder from must support data flow variables for this to appear) Column. You can directly select a column from one of the tables in the table selection canvas.

In/Not In. Choose IN or NOT IN from the drop-down list to specify whether the value should be in the specified list or not in it. Selection. These fields allows you to specify the list used by the query. Use the menu button to the right of the single field to specify details about the argument you are using to specify a list item, then enter a value. Click the double right arrow to add the value to the list. To remove an item from the list, select it then click the double left arrow.

Parallel Job Developers Guide

59-17

Expression Editor

SQL Builder

Like
The expression editor when you have selected the Like predicate is as follows:

The fields it contains are: Column. Choose the column on which you are filtering from the drop-down list. You can specify one of the following in identifying a column:

Job parameter. A dialog box appears offering you a choice of available job parameters. This allows you to specify the value to be used in the query at run time (the stage you are using the SQL Builder from must allow job parameters for this to appear). Expression. An expression editor dialog box appears, allowing you to specify an expression that represents the value to be used in the query. Data flow variable. A dialog box appears offering you a choice of available data flow variables (the stage you are using the SQL Builder from must support data flow variables for this to appear) Column. You can directly select a column from one of the tables in the table selection canvas.

Like/Not Like. Choose LIKE or NOT LIKE from the drop-down list to specify whether you are including or excluding a value in your comparison. Like Operator. Choose the type of Like or Not Like comparison you want to perform from the drop-down list. Available operators are:

Match Exactly. Your query will ask for an exact match to the value you specify. Starts With. Your query will match rows that start with the value you specify. Ends With. Your query will match rows that end with the value you specify. Contains. Your query will match rows that contain the value you specify anywhere within them.

59-18

Parallel Job Developers Guide

SQL Builder

Expression Editor

Like Value. Specify the value that your LIKE predicate will attempt to match.

Null
The expression editor when you have selected the Null predicate is as follows:

The fields it contains are: Column. Choose the column on which you are filtering from the drop-down list. You can specify one of the following in identifying a column:

Job parameter. A dialog box appears offering you a choice of available job parameters. This allows you to specify the value to be used in the query at run time (the stage you are using the SQL Builder from must allow job parameters for this to appear). Expression. An expression editor dialog box appears, allowing you to specify an expression that represents the value to be used in the query. Data flow variable. A dialog box appears offering you a choice of available data flow variables (the stage you are using the SQL Builder from must support data flow variables for this to appear) Column. You can directly select a column from one of the tables in the table selection canvas.

Is Null/Is Not Null. Choose whether your query will match a NULL or NOT NULL condition in the column.

Join
This predicate is only available when you are building an Oracle 8i query. The Expression Editor is as follows:

Left column. Choose the column to be on the left of your join from the drop-down list.

Parallel Job Developers Guide

59-19

Expression Editor

SQL Builder

Join type. Choose the type of join from the drop-down list. Right column. Choose the column to be on the right of your query from the drop-down list.

Calculation/Function Expression Editor


This version of the expression editor allows you to specify an expression within a WHERE or HAVING expression, or a join condition. Expression Editor dialogs are numbered to show how deeply you are nesting them. Here is Calculation/Function expression editor opened from within an ordinary expression editor:

Fields in the Expression Editor panel vary according to the chosen predicate as follows:

Calculation
The expression editor when you have selected the Calculation predicate is as follows:

The fields it contains are: Left Value. Enter the argument you want on the left of your calculation. You can choose the type of argument by clicking the menu button on the right and choosing a type from the menu. Calculation Operator. Choose the operator for your calculation from the drop-down list. Right Value. Enter the argument you want on the right of your calculation. You can choose the type of argument by clicking the menu button on the right and choosing a type from the menu.

59-20

Parallel Job Developers Guide

SQL Builder

Expression Editor

Functions
The expression editor when you have selected the Functions predicate is as follows:

The fields it contains are: Function. Choose a function from the drop-down list. The functions available depend on what stage you have invoked the SQL Builder from (i.e. which database you are building the query for). Description. Gives a description of the function you have selected. Parameters. Enter the parameters required by the function you have selected. The parameters that are required vary according to the selected function.

Expression Editor Menus


A button appears to the right of many of the fields in the expression editor and related dialogs. Where it appears you can click it to open a menu that allows you to specify more details about an argument being given in an expression.

Bit. Specifies that the argument is of type bit. The argument field offers a choice of 0 or 1 in a drop-down list. Column. Specifies that the argument is a column name. The argument field offer a choice of available columns in a drop-down list.

Parallel Job Developers Guide

59-21

Expression Editor

SQL Builder

Date. Specifies that the argument is a date. The SQL Builder enters today's date in the format expected by the database you are building the query for. You can edit this date as required or click the drop-down button and select from a calendar. Date Time. Specifies that the argument is a date time. The SQL Builder inserts the current date and time in the format that the database the query is being built for expects. You can edit the date time as required. Default. Allows you to select the default value of an argument (if one is defined). Expression Editor. You can specify a function or calculation expression as an argument of an expression. Selecting this causes the Calculation/Function version of the expression editor to open. Function. You can specify a function as an argument to an expression. Selecting this causes the Functions Form dialog box to open. The functions available depend on the database that the query you are building is intended for. Job Parameter. You can specify that the argument is a job parameter, the value for which is supplied when you actually run the DataStage job. Selecting this opens the Parameters dialog box. Integer. Choose this to specify that the argument is of integer type. String. Select this to specify that the argument is of string type. Timestamp. Specifies that the argument is a timestamp. The SQL Builder inserts the current date and time in the format that the database the query is being built for expects. You can edit the timestamp as required.

Function Form Dialog Box


This dialog box allows you to select a function for use within and expression, and specify parameters for the function.

59-22

Parallel Job Developers Guide

SQL Builder

Joining Tables

The fields are as follows: Function. Choose a function from the drop-down list. The functions available depend on what stage you have invoked the SQL Builder from (i.e. which database you are building the query for). Format. Gives the format of the selected function as a guide. Description. Gives a description of the function you have selected. Result. Shows the actual function that will be included in the query as specified in this dialog box. Parameters. Enter the parameters required by the function you have selected. The parameters that are required vary according to the selected function.

Parameters Dialog Box


This dialog box lists the job parameters that are currently defined for the job within which you are working. It also gives the data type of the parameter. Note that the SQL Builder does not check that the type of parameter you are inserting matches the type expected by the argument you are using it for.

Joining Tables
When you drag multiple tables onto the table selection canvas, the SQL Builder attempts to create a join between the table added and the one already on the canvas to its left. It uses captured foreign key meta data where this is available. The join is represented by a line joining the columns the SQL Builder has decided to join on (after the SQL

Parallel Job Developers Guide

59-23

Joining Tables

SQL Builder

Builder has automatically inserted a join, you can amend it if required).

Different types of join are represented by different types of line, as follows:

The SQL Builder follows this procedure when determining what type of join to initially insert between two tables:
1 2 3

If the added table has a table to its left, make the table to the left the subject: If foreign key information exists between the added table and the subject table: Choose between alternatives based on the following precedence: Relations that relate to the added tables key fields Any other join

4 5 6

Construct an INNER JOIN between the two tables with the chosen relationship dictating the join criteria Otherwise take the subject as the next table to the left and try again from step 2. Otherwise construct a join based on which are supported, chosen in the following precedence order: INNER JOIN with no join condition (will fail validation) CROSS JOIN (cartesian product)

59-24

Parallel Job Developers Guide

SQL Builder

Joining Tables

7 8 9

If the added table has a table to its right, make the added table the subject and the table to the right the target: If foreign key information exists between the target and the subject table: Choose between alternatives based on the following precedence: Relations that relate to the added tables key fields Any other join

10 Construct an INNER JOIN between the two tables with the chosen

relationship dictating the join criteria


11 Otherwise take the subject as the next table to the left and try

again from step 8.


12 Otherwise construct a join based on which are supported, chosen

in the following precedence order: INNER JOIN with no join condition (will fail validation) CROSS JOIN (cartesian product) If the join inserted by SQL Builder is not what is required, you can specify your own join.

Specifying Joins
There are three ways of altering the automatic join that the SQL Builder inserts when you add more than one table to the table selection canvas: Using the Join Properties dialog box. Open this by selecting the link in the table selection canvas, right clicking and choosing Properties from the short cut menu. This dialog allows you to choose a different type of join, choose alternative conditions for the join, or choose a natural join. Using the Alternate Relation dialog box. Open this by selecting the link in the table selection canvas, right clicking and choosing Alternate Relation from the shortcut menu. This dialog allows you to change foreign key relationships that have been specified for the joined tables. By dragging a column from one table to another column in any table to its right on the canvas. This replaces the existing automatic join and specifies an equijoin between the source and target column. If the join being replaced is currently specified as an inner or outer join, then the type is preserved, otherwise the new join will be an inner join.

Parallel Job Developers Guide

59-25

Joining Tables

SQL Builder

Yet another approach is specify the join using a WHERE clause rather than an explicit join operation (although this is not recommended where your database supports explicit join statements). In this case you would:
1 2

Specify the join as a cartesian product. (SQL Builder does this automatically if it cannot determine the type of join required). Specify a filter in the Selection tab filter panel. This specifies a WHERE clause that selects rows from within the cartesian product.

If you are you are using the SQL Builder to build an Oracle 8i query, you can use the Expression Editor to specify a join condition, which will be implemented as a WHERE statement (Oracle 8i does not support JOIN statements).

Join Properties Dialog Box


This dialog box allows you to change the type of an existing join and modify or specify the join condition.

The dialog box contains the following fields: Cartesian Product. The cartesian product (also known as a cross join) is the result returned from two or more tables that are selected from, but not joined as such (i.e., no join condition is specified). The output is all possible rows from all the tables selected from. For example, if you selected from two tables, the database would pair every row in the first table with every row in the second table. If each table had 6 rows, the cartesian product would return 36 rows. Where the SQL Builder cannot insert a join based on available information, it will default to the cartesian product. You can explicitly specify a cartesian product by selecting the Cartesian
59-26 Parallel Job Developers Guide

SQL Builder

Joining Tables

product option in the Join Properties dialog box. The cross join icon is shown on the join. Table join. Select the Table Join option to specify that your query will contain join condition for the two tables being joined. The Join Condition panel is enabled, allowing you to specify further details about the join. Expression panel. This shows the expression that the join condition will contain. You can enter or edit the expression manually or you can use the menu button to the right of the panel to specify a natural join, open the Expression Editor, or open the Alternate relation dialog box. Include. These fields allow you to specify that the join should be an outer join.

Select All rows from left table name to specify a left outer join Select All rows from right table name to specify a right outer join Select both All rows from left table name and All rows from right table name to specify a full outer join

Join Icon. This tells you the type of join you have specified.

Alternate Relation Dialog Box


This dialog box displays all the foreign key relationships that have been defined between the target table and other tables that appear to the left of it in the table selection canvas. You can select the relationship that you want to appear as the join in your query by selecting it so that it appears in the lower window, and clicking OK.

Parallel Job Developers Guide

59-27

Properties Dialogs

SQL Builder

Properties Dialogs
Depending where you are in the SQL Builder, choosing Properties from the shortcut menu opens a dialog box as follows: The Table Properties dialog box opens when you select a table in the table selection canvas and choose Properties from the shortcut menu. The SQL Properties dialog box opens when you select the Properties icon in the toolbox or Properties from the table selection canvas background. The Join Properties dialog box opens when you select a join in the table selection canvas and choose Properties from the shortcut menu. This dialog is described on page 59-26.

Table Properties Dialog Box


The Table Properties dialog box is as follows:

It contains the following fields: Table name. The name of the table whose properties you are viewing. You can click the menu button and choose Job Parameter to open the Parameter dialog box (see page 59-23). This allows you to specify a job parameter to replace the table name if required, but note that the SQL Builder will always refer to this table using its alias. Alias. The alias that the SQL Builder uses to refer to this table. You can edit the alias if required. If the table alias is used in the selection grid or filters, changing the alias in this dialog box will update the alias there.

59-28

Parallel Job Developers Guide

SQL Builder

Example Queries

SQL Properties Dialog Box


This dialog box gives you details about the syntax of the SQL that the SQL Builder is currently building. The syntax depends on the type of stage that you invoked the builder from.

It contains the following fields: Syntax. This panel gives the name, version, and description of the SQL syntax that the SQL Builder is currently building. This depends on the stage that you have invoked the SQL Builder from. SQL. This panel allows you to select whether the SQL builder supports the DISTINCT qualifier. This is normally filled in by default according to the syntax in operation.

Example Queries
This section gives some examples of how different queries are built from the SQL Builder to illustrate its use. It includes examples of the following types of query. Simple select query Selecting from two tables with an inner join based on equality. Query performing an aggregation.

Example Simple Select Query


This query selects the columns Customer, AccountNo, AccountType, and balance from the table accounts. Only those rows where the

Parallel Job Developers Guide

59-29

Example Queries

SQL Builder

balance exceeds $1,000 are will be returned. The results will be in alphabetical order by customer. First, we need to find the table definition for the table accounts, which has previously been imported into the DataStage repository:

Next, we drag the highlighted table to the table selection canvas:

Then we need to define the columns that the query will select from this table. We do this by selecting from the Column Expression

59-30

Parallel Job Developers Guide

SQL Builder

Example Queries

drop-down list in the column selection grid. We also specify that the results will appear in alphabetical order by customer name.

The last step is to specify that only rows containing a balance column of more than 1000 should be returned. This is done in the Filter expression panel, first by selecting the comparison predicate, then by specifying the comparison in the expression editor and clicking the Add button:

The completed query can be viewed on the Sql tab:

Parallel Job Developers Guide

59-31

Example Queries

SQL Builder

Example Inner Join


This query selects the columns Customer, accountType and balance from the table accounts, and the column InterestRate from the table interest. Both tables have a column called accountType, and the tables will be joined on rows where the accountType contains identical values. Only rows that have matching accountType columns will be returned. For this query, both the accounts table and the interest table are dragged from the repository window to the table selection canvas. Because the table definitions have no foreign key information defined, the SQL Builder inserts a cross-join (i.e., the query would return the cartesian product of the two tables if it was run at this point).

The next step is to define the actual join required. The Join Properties dialog box is opened by selecting the link, right-clicking and choosing Properties from the shortcut menu. The Table Join option is selected, and the Expression Editor opened by clicking the menu button and selecting Expression Editor from the menu. The comparison predicate is chosen, and the expression editor used to define a test of equality as follows:
1 2 3 4

Choose accounts.accountsType from the column drop-down list. Choose = from the operator drop-down list. Click the menu button next to the comparison value field and choose a type of Column from the menu that appears. Choose interest.accountType from the drop-down list of columns that appears in the comparison value field.

59-32

Parallel Job Developers Guide

SQL Builder

Example Queries

Click OK.

The join expression appears in the Join Condition window:

The columns to be returned by the query are defined in the column selection grid:

The finished query can be viewed on the Sql tab:

Parallel Job Developers Guide

59-33

Example Queries

SQL Builder

Example Aggregate Query


This query returns the average account balance for different types of account. The results are grouped by account type, and only account types where the average balance is greater than $1000 are asked for. The whole query can be defined on the Group tab. Firstly the accounts table is dragged to the table definition canvas from the repository window.

Next, the columns to be included in the query are defined in the grouping grid:

The average aggregation is performed on the balance column and the results are grouped by the accountType column. Next the HAVING clause is specified in the filter panel to limit the groups displayed to those with an average greater than 1000. Because we want to test the average of the balance column in the expression, we have to open the expression editor from the column drop-down list in the Filter expression panel. The expression editor allows you to use calculations or functions in place of a column name. In this case we define that we want the average of the balance column.

59-34

Parallel Job Developers Guide

SQL Builder

Example Queries

The expression then appears in the filter panel:

And, when Add is clicked, the final HAVING expression appears in the filter expression panel:

Finally, the completed query can be viewed in the Sql tab:

Parallel Job Developers Guide

59-35

Example Queries

SQL Builder

59-36

Parallel Job Developers Guide

60
Remote Deployment
Remote deployment of parallel jobs allows job scripts to be stored and run on a separate machine from the DataStage Server machine. The remote deployment option can, for example, be used to run jobs on a computer grid. Any remote system that has a job so deployed must have access to the Parallel Engine in order to run the job (see the section "Copying the Parallel Engine to Your System Nodes" in DataStage Install and Upgrade Guide). Such systems must also have the correct runtime libraries for that platform type (see "Deployment Systems" in Install and Upgrade Guide). Because these jobs are not run on the DataStage Server, server components (such as BASIC Transformer stages, server shared containers, before and after subroutines, and job control routines) cannot be used. There is also a limited set of plug-in stages available for use in these jobs. When you run the jobs the logging, monitoring, and operational meta data collection facilities provided by DataStage are not available. Deployed jobs do output logging information in internal paralle engine format, but provision for collecting this is the users responsibility. You develop a Parallel job for deployment using the DataStage Designer, and then compile it. A deployment package is automatically produced at compilation. Such jobs can also be run under the control of the DataStage Server (using Designer or Director clients, or the dsjob command) as per normal. (Note that running jobs in the normal way runs the executables in the project directory, not the deployment scripts.)

Parallel Job Developers Guide

60-1

Enabling a Project for Job Deployment

Remote Deployment

It is your responsibility to define a configuration file on the remote machine, transfer the deployment package to the remote computer and to run the job. The following diagram gives a conceptual view of an example deployment system. In this example, deployable jobs are transferred to three conductor node machines. Each conductor node has a configuration file describing the resources that it has available for running the jobs. The jobs then run under the control of that conductor:

Design time DS Server Deployable scripts Design time DS Server

Acxiom Grid

Conductor Node

Conductor Node

Conductor Node

Node

Node

Node Node

Node Node

Node Node

Note The DataStage Server system and the Node systems must be running the same operating system.

Enabling a Project for Job Deployment


Projects are made capable of deploying jobs in this way from the DataStage Administrator client. To make the jobs in a project deployable:

60-2

Parallel Job Developers Guide

Remote Deployment

Enabling a Project for Job Deployment

1 2 3 4

Start the DataStage Administrator client. Go to the Projects tab and select the project whose parallel jobs you want to make deployable from the list. Click the Properties button to open the Project Properties dialog box. Go to the Remote tab.

In the Base directory name field, provide a home directory location for deployment; in this directory there will be one directory for each project. This location has to be accessible from the server machine, but does not have to be a disk local to that machine. Providing a location here enables the job deployment features. In the Deployed job directory template field, optionally specify an alternative name for the deployment directory associated with a particular job. This field is used in conjunction with Base directory name in the form base_name/project_name/ job_directory. By default, if nothing is specified, the name corresponds to the internal script directory used on the DataStage server project directory, RT_SCjobnum, where jobnum is the internal job number allocated to the job. Substitution strings provided are:

%j jobname

Parallel Job Developers Guide

60-3

Deployment Package

Remote Deployment

%d internal number

The simplest case is just %j - use the jobname. A prefix can be used, i.e., job_%j The default corresponds to RT_SC%d. .
7

In the Custom deployment commands field, optionally specify further actions to be carried out at the end of a deployment compile. You can specify Unix programs and /or calls to user shell scripts as required. The actions take place in the deployment directory for the job. This field uses the same substitution strings as the directory template. For example:
tar cvf ../%j.tar * ; compress ../%j.tar

will create a compressed tar archive of the deployed job, named after the job.
Note If either, or both, of the USS check boxes on this tab are selected, then USS deployment will be implemented see Chapter 56, "Parallel Jobs on USS". You should not check these boxes.

Deployment Package
When you compile a job in the DataStage Designer with project deployment enabled, the following files are produced: Command shell script Environment variable setting source script Main Parallel (osh) program script Script parameter file XML report file (if enabled see "Enabling/Disabling Generation of XML Report" in Parallel Job Advanced User Guide). Compiled transformer binary files (if the job contains any Transformer stages) Transformer source compilation scripts These are the files that will be copied to a job directory in the base directory specified in the Administrator client for the project. By default the job directory is called RT_SCjobnum, where jobnum is the internal job number allocated to the job (you can change the form of the name in the Administrator client). If you have additional custom components designed outside the job (for example, custom, built, or wrapped stages) you should ensure

60-4

Parallel Job Developers Guide

Remote Deployment

Deployment Package

that these are available when the job is run. They are not automatically packaged and deployed.

Command Shell Script pxrun.sh


The command shell script sources the environment variable script, then calls the PXEngine, specifying the main osh program script and script parameter file as parameters. Run this script to run your job.

Environment Variable Setting Source Script evdepfile


This file contains the environment variables for a deployed job when it is run. It is based on the environment variables set when the job was compiled. It is possible to edit this file manually if required before running a job. The file can be removed altogether, but it is then your responsibility to set up the environment before running the job.

Main Parallel (OSH) Program Script OshScript.osh


The main parallel job script. You must execute the command shell script in order to run this, you should not run it directly.

Script Parameter File jpdepfile


This is used by pxrun.sh. It contains the job parameters for a deployed job when it is run. It is based on the default job parameters when the job was compiled. It is possible to edit this file manually if required before running a job.

XML Report File <jobname>.xml


An XML report of the job design can be automatically generated at compile time (if enabled using an administration command see "Enabling/Disabling Generation of XML Report" in Parallel Job Advanced User Guide and the report is included in the job deployment package. For more information on HTML and XML job reports, see "Job Reports" in DataStage Designer Guide.

Compiled Transformer Binary Files <jobnamestagename>.trx.so


There is one of these for each Transformer stage in your job.

Parallel Job Developers Guide

60-5

Deploying a Job

Remote Deployment

Self-Contained Transformer Compilation


In order to make the job self-contained with regard to transformer compilation, there are the following additional files which can optionally be used for transformer recompilation; none are present if there are no transformers in the job: Transformer source files (internal transformer language). It has a name in the form <internalidentifier>_<jobname>_<stagename>.trx. There is one such file for each Transformer stage in the job. Shell scripts to run transformer operator compile jobs. It has a name in the form <internalidentifier>_<jobname>_<stagename>.trx.sh. There is one such file for each Transformer stage in the job. Transformer compilation operator osh scripts. This is a Parallel job script to compile the corresponding Transformer stage. It is called from corresponding .sh file. It has a name in the form <internalidentifier>_<jobname>_<stagename>.trx.osh. There is one such file for each Transformer stage in the job. One master shell script to call all transformer compile scripts called pxcompile.sh. If you want to recompile transformers on your deployment platform before running the job, you should run pxcompile.sh.

Deploying a Job
This describes how to design a job on the DataStage system in a remote deployment project, transfer it to the deployment machine, and run it.
1

In the DataStage Administrator, specify a remote deployment project as described in "Enabling a Project for Job Deployment" on page 60-2. Define a configuration file on your remote deployment systems that will describe it. Use the environment variable APT_CONFIG_FILE to identify it on the remote machine. You can do this in one of three ways:

If you are always going to use the same configuration file on the same remote system, define APT_CONFIG_FILE on a project-wide basis in the DataStage administrator. All your remote deployment job packages will have that value for APT_CONFIG_FILE.

60-6

Parallel Job Developers Guide

Remote Deployment

Server Side Plug-Ins

To specify the value at individual job level, specify APT_CONFIG_FILE as a job parameter and set the default value to the location of the configuration file. This will be packaged with that particular job. To specify the value at run time, set the value of APT_CONFIG_FILE to $ENV in the DataStage Administrator and then define APT_CONFIG_FILE as an environment variable on your remote machine. The job will pick up the value at run time.

In the DataStage Designer, design your parallel job as normal (but remember that you cannot use BASIC Transformer stages, shared containers, or plugin stages in remote deployment jobs). When you are happy with your job design, compile it. If your job contains Transformer stages, you can if required recompile the transformers on the deployment machine. To do this, execute the following file: pxcompile.sh When your Transformer stages have successfully compiled, run the job by executing the following file: pxrun.sh

4 5

Server Side Plug-Ins


DataStage XML and Java plug-ins operate on remote nodes. The following directories are required on the nodes in order to run a plugin, these can be copied from a DataStage Server installation: DSEngine/java DSEngine/lib DSCAPIOp
Note The Java plug-in does not run on Red Hat Enterprise Linux AS 2.1/Red Hat 7.3.

Parallel Job Developers Guide

60-7

Server Side Plug-Ins

Remote Deployment

60-8

Parallel Job Developers Guide

A
Schemas
Schemas are an alternative way for you to specify column definitions for the data used by parallel jobs. By default, most parallel job stages take their meta data from the Columns tab, which contains table definitions, supplemented, where necessary by format information from the Format tab. For some stages, you can specify a property that causes the stage to take its meta data from the specified schema file instead. Some stages also allow you to specify a partial schema. This allows you to describe only those columns that a particular stage is processing and ignore the rest. The schema file is a plain text file, this appendix describes its format. A partial schema has the same format.
Note If you are using a schema file on an NLS system, the schema file needs to be in UTF-8 format. It is, however, easy to convert text files between two different maps with a DataStage job. Such a job would read data from a text file using a Sequential File stage and specifying the appropriate character set on the NLS Map page. It would write the data to another file using a Sequential File stage, specifying the UTF-8 map on the NLS Map page.

Schema Format
A schema contains a record (or row) definition. This describes each column (or field) that will be encountered within the record, giving column name and data type. The following is an example record schema:

Parallel Job Developers Guide

A-1

Schema Format

Schemas

record ( name:string[255]; address:nullable string[255]; value1:int32; value2:int32 date:date)

(The line breaks are there for ease of reading, you would omit these if you were defining a partial schema, for example record(name:string[255];value1:int32;date:date) is a valid schema.) The format of each line describing a column is:
column_name:[nullability]datatype;

column_name. This is the name that identifies the column. Names must start with a letter or an underscore (_), and can contain only alphanumeric or underscore characters. The name is not case sensitive. The name can be of any length. nullability. You can optionally specify whether a column is allowed to contain a null value, or whether this would be viewed as invalid. If the column can be null, insert the word nullable. By default columns are not nullable. You can also include nullable at record level to specify that all columns are nullable, then override the setting for individual columns by specifying not nullable. For example:
record nullable ( name:not nullable string[255]; value1:int32; date:date)

datatype. This is the data type of the column. This uses the internal data types as described on page 2-28, not SQL data types as used on Columns tabs in stage editors. You can include comments in schema definition files. A comment is started by a double slash //, and ended by a newline.

A-2

Parallel Job Developers Guide

Schemas

Schema Format

The example schema corresponds to the following table definition as specified on a Columns tab of a stage editor:

The following sections give special consideration for representing various data types in a schema file.

Date Columns
The following examples show various different data definitions:
record record record record (dateField1:date; ) // single date (dateField2[10]:date; ) // 10-element date vector (dateField3[]:date; ) // variable-length date vector (dateField4:nullable date;) // nullable date

(See "Complex Data Types" on page 2-32 for information about vectors.)

Decimal Columns
To define a record field with data type decimal, you must specify the columns precision, and you may optionally specify its scale, as follows:
column_name:decimal[ precision, scale];

where precision is greater than or equal 1 and scale is greater than or equal to 0 and less than precision. If the scale is not specified, it defaults to zero, indicating an integer value. The following examples show different decimal column definitions:

Parallel Job Developers Guide

A-3

Schema Format

Schemas

record (dField1:decimal[12]; ) // 12-digit integer record (dField2[10]:decimal[15,3]; )// 10-element //decimal vector record (dField3:nullable decimal[15,3];) // nullable decimal

Floating-Point Columns
To define floating-point fields, you use the sfloat (single-precision) or dfloat (double-precision) data type, as in the following examples:
record record record record (aSingle:sfloat; aDouble:dfloat; ) // float definitions (aSingle: nullable sfloat;) // nullable sfloat (doubles[5]:dfloat;) // fixed-length vector of dfloats (singles[]:sfloat;) // variable-length vector of sfloats

Integer Columns
To define integer fields, you use an 8-, 16-, 32-, or 64-bit integer data type (signed or unsigned), as shown in the following examples:
record (n:int32;) // 32-bit signed integer record (n:nullable int64;) // nullable, 64-bit signed integer record (n[10]:int16;) // fixed-length vector of 16-bit //signed integer record (n[]:uint8;) // variable-length vector of 8-bit unsigned //int

Raw Columns
You can define a record field that is a collection of untyped bytes, of fixed or variable length. You give the field data type raw. The definition for a raw field is similar to that of a string field, as shown in the following examples:
record record record record (var1:raw[];) // variable-length raw field (var2:raw;) // variable-length raw field; same as raw[] (var3:raw[40];) // fixed-length raw field (var4[5]:raw[40];)// fixed-length vector of raw fields

You can specify the maximum number of bytes allowed in the raw field with the optional property max, as shown in the example below:
record (var7:raw[max=80];)

The length of a fixed-length raw field must be at least 1.

String Columns
You can define string fields of fixed or variable length. For variablelength strings, the string length is stored as part of the string as a hidden integer. The storage used to hold the string length is not included in the length of the string.

A-4

Parallel Job Developers Guide

Schemas

Schema Format

The following examples show string field definitions:


record record record record record record (var1:string[];) // variable-length string (var2:string;) // variable-length string; same as string[] (var3:string[80];) // fixed-length string of 80 bytes (var4:nullable string[80];) // nullable string (var5[10]:string;) // fixed-length vector of strings (var6[]:string[80];) // variable-length vector of strings

You can specify the maximum length of a string with the optional property max, as shown in the example below:
record (var7:string[max=80];)

The length of a fixed-length string must be at least 1.

Time Columns
By default, the smallest unit of measure for a time value is seconds, but you can instead use microseconds with the [microseconds] option. The following are examples of time field definitions:
record record record record (tField1:time; ) // single time field in seconds (tField2:time[microseconds];)// time field in //microseconds (tField3[]:time; ) // variable-length time vector (tField4:nullable time;) // nullable time

Timestamp Columns
Timestamp fields contain both time and date information. In the time portion, you can use seconds (the default) or microseconds for the smallest unit of measure. For example:
record record record record (tsField1:timestamp;)// single timestamp field in //seconds (tsField2:timestamp[microseconds];)// timestamp in //microseconds (tsField3[15]:timestamp; )// fixed-length timestamp //vector (tsField4:nullable timestamp;)// nullable timestamp

Vectors
Many of the previous examples show how to define a vector of a particular data type. You define a vector field by following the column name with brackets []. For a variable-length vector, you leave the brackets empty, and for a fixed-length vector you put the number of vector elements in the brackets. For example, to define a variablelength vector of int32, you would use a field definition such as the following one:
intVec[]:int32;

To define a fixed-length vector of 10 elements of type sfloat, you would use a definition such as:
sfloatVec[10]:sfloat;

Parallel Job Developers Guide

A-5

Schema Format

Schemas

You can define a vector of any data type, including string and raw. You cannot define a vector of a vector or tagged type. You can, however, define a vector of type subrecord, and you can define that subrecord includes a tagged column or a vector. You can make vector elements nullable, as shown in the following record definition:
record (vInt[]:nullable int32; vDate[6]:nullable date; )

In the example above, every element of the variable-length vector vInt will be nullable, as will every element of fixed-length vector vDate. To test whether a vector of nullable elements contains no data, you must check each element for null.

Subrecords
Record schemas let you define nested field definitions, or subrecords, by specifying the type subrec. A subrecord itself does not define any storage; instead, the fields of the subrecord define storage. The fields in a subrecord can be of any data type, including tagged. The following example defines a record that contains a subrecord:
record ( intField:int16; aSubrec:subrec ( aField:int16; bField:sfloat; ); )

In this example, the record contains a 16-bit integer field, intField, and a subrecord field, aSubrec. The subrecord includes two fields: a 16-bit integer and a single-precision float. Subrecord columns of value data types (including string and raw) can be nullable, and subrecord columns of subrec or vector types can have nullable elements. A subrecord itself cannot be nullable. You can define vectors (fixed-length or variable-length) of subrecords. The following example shows a definition of a fixed-length vector of subrecords:
record (aSubrec[10]:subrec ( aField:int16; bField:sfloat; ); )

You can also nest subrecords and vectors of subrecords, to any depth of nesting. The following example defines a fixed-length vector of subrecords, aSubrec, that contains a nested variable-length vector of subrecords, cSubrec:
record (aSubrec[10]:subrec ( aField:int16; bField:sfloat;

A-6

Parallel Job Developers Guide

Schemas

Partial Schemas

cSubrec[]:subrec ( cAField:uint8; cBField:dfloat; ); ); )

Subrecords can include tagged aggregate fields, as shown in the following sample definition:
record (aSubrec:subrec ( aField:string; bField:int32; cField:tagged ( dField:int16; eField:sfloat; ); ); )

In this example, aSubrec has a string field, an int32 field, and a tagged aggregate field. The tagged aggregate field cField can have either of two data types, int16 or sfloat.

Tagged Columns
You can use schemas to define tagged columns (similar to C unions), with the data type tagged. Defining a record with a tagged type allows each record of a data set to have a different data type for the tagged column. When your application writes to a field in a tagged column, DataStage updates the tag, which identifies it as having the type of the column that is referenced. The data type of a tagged columns can be of any data type except tagged or subrec. For example, the following record defines a tagged subrecord field:
record ( tagField:tagged ( aField:string; bField:int32; cField:sfloat; ) ; )

In the example above, the data type of tagField can be one of following: a variable-length string, an int32, or an sfloat.

Partial Schemas
Some parallel job stages allow you to use a partial schema. This means that you only need define column definitions for those columns that you are actually going to operate on. The stages that allow you to do this are file stages that have a Format tab. These are:

Parallel Job Developers Guide

A-7

Partial Schemas

Schemas

Sequential File stage File Set stage External Source stage External Target stage Column Import stage You specify a partial schema using the Intact property on the Format tab of the stage together with the Schema File property on the corresponding Properties tab. To use this facility, you need to turn Runtime Column Propagation on, and provide enough information about the columns being passed through to enable DataStage to skip over them as necessary. In the file defining the partial schema, you need to describe the record and the individual columns. Describe the record as follows: intact. This property specifies that the schema being defined is a partial one. You can optionally specify a name for the intact schema here as well, which you can then reference from the Intact property of the Format tab. record_length. The length of the record, including record delimiter characters. record_delim_string. String giving the record delimiter as an ASCII string in single quotes. (For a single character delimiter, use record_delim and supply a single ASCII character in single quotes). Describe the columns as follows: position. The position of the starting character within the record. delim. The column trailing delimiter, can be any of the following:

ws to skip all standard whitespace characters (space, tab, and newline) trailing after a field. end to specify that the last field in the record is composed of all remaining bytes until the end of the record. none to specify that fields have no delimiter. null to specify that the delimiter is the ASCII null character. ASCII_char specifies a single ASCII delimiter. Enclose ASCII_char in single quotation marks. (To specify multiple ASCII characters, use delim_string followed by the string in single quotes.)

text specifies the data representation type of a field as being text rather than binary. Data is formatted as text by default. (Specify binary if data is binary.)

A-8

Parallel Job Developers Guide

Schemas

Partial Schemas

Columns that are being passed through intact only need to be described in enough detail to allow DataStage to skip them and locate the columns that are to be operated on. For example, say you have a sequential file defining rows comprising six fixed width columns, and you are in interested in the last two. You know that the first four columns together contain 80 characters. Your partial schema definition might appear as follows:
record { intact=details, record_delim_string = '\r\n' } ( colstoignore: string [80] name: string [20] { delim=none }; income: uint32 {delim = ,, text };

Your stage would not be able to alter anything in a row other than the name and income columns (it could also add a new column to either the beginning or the end of a row).

Parallel Job Developers Guide

A-9

Partial Schemas

Schemas

A-10

Parallel Job Developers Guide

B
Functions
This appendix describes the functions that are available from the expression editor under the Function menu item. You would typically use these functions when defining a column derivation in a Transformer stage. The functions are described by category. This set includes functions that take string arguments or return string values. If you have NLS enabled, the arguments strings or returned strings can be strings or ustrings. The same function is used for either string type. The only exceptions are the functions StringToUstring () and UstringToString ().

Date and Time Functions


The following table lists the functions available in the Date & Time category (Square brackets indicate an argument is optional):
Name
DateFromDaysSince

Description
Returns a date by adding an integer to a baseline date Returns a date from the given julian date Returns the number of days from source date to the given date Returns the hour portion of a time

Arguments
number (int32) [baseline date] juliandate (uint32) source_date given_date time

Output
date

DateFromJulianDay DaysSinceFromDate

date days since (int32)

HoursFromTime

hours (int8)

Parallel Job Developers Guide

B-1

Date and Time Functions

Functions

Name
JulianDayFromDate MicroSecondsFromTime

Description
Returns julian day from the given date Returns the microsecond portion from a time Returns the minute portion from a time Returns the day of the month given the date Returns the month number given the date Returns the date of the specified day of the week soonest after the source date Returns the date of the specified day of the week most recent before the source date Returns the second portion from a time Returns the number of seconds between two timestamps Returns the system time and date as a formatted string Returns the time given the number of seconds since midnight Returns a timestamp form the given date and time Returns the timestamp from the number of seconds from the base timestamp Returns a timestamp from the given unix time_t value Returns a unix time_t value from the given timestamp

Arguments
date time

Output
julian date (int32) microseconds (int32) minutes (int8) day (int8) month number (int8) date

MinutesFromTime MonthDayFromDate MonthFromDate NextWeekdayFromDate

time date date source date day of week (string)

PreviousWeekdayFromDate

source date day of week (string)

date

SecondsFromTime SecondsSinceFromTimestamp

time timestamp base timestamp -

seconds (dfloat) seconds (dfloat)

TimeDate

system time and date (string) time

TimeFromMidnightSeconds

seconds (dfloat)

TimestampFromDateTime

date time seconds (dfloat) [base timestamp]

timestamp

TimestampFromSecondsSince

timestamp

TimestampFromTimet

timet (int32)

timestamp

TimetFromTimestamp

timestamp

timet (int32)

B-2

Parallel Job Developers Guide

Functions

Date and Time Functions

Name
WeekdayFromDate

Description
Returns the day number of the week from the given date. Origin day optionally specifies the day regarded as the first in the week and is Sunday by default Returns the day number in the year from the given date Returns the year from the given date Returns the week number in the year from the given date

Arguments
date [origin day]

Output
day (int8)

YeardayFromDate

date

day (int16)

YearFromDate YearweekFromDate

date date

year (int16) week (int16)

Date, Time, and Timestamp functions that specify dates, times, or timestamps in the argument use strings with specific formats: For a date, the format is %yyyy-%mm-%dd For a time, the format is %hh:%nn:%ss, or, if extended to include microseconds, %hh:%nn:%ss.x where x gives the number of decimal places seconds is given to. For a timestamp the format is %yyyy-%mm-%dd %hh:%nn:%ss, or, if extended to include microseconds, %yyyy-%mm-%dd %hh:%nn:%ss.x where x gives the number of decimal places seconds is given to. This applies to the arguments date, baseline date, given date, time, timestamp, and base timestamp. Functions that have days of week in the argument take a string specifying the day of the week, this applies to day of week and origin day.

Parallel Job Developers Guide

B-3

Logical Functions

Functions

Logical Functions
The following table lists the functions available in the Logical category (square brackets indicate an argument is optional):
Name
Not

Description
Returns the complement of the logical value of an expression Returns the bitwise AND of the two integer arguments Returns the bitwise OR of the two integer arguments Returns the bitwise Exclusive OR of the two integer arguments Returns a string containing the binary representation in "1"s and "0"s of the given integer Returns the integer made from the string argument, which contains a binary representation of "1"s and "0"s. Returns an integer with specific bits set to a specific state, where origfield is the input value to perform the action on, bitlist is a string containing a list of comma separated bit numbers to set the state of, and bitstate is either 1 or 0, indicating which state to set those bits.

Arguments
expression

Output
Complement (int8)

BitAnd

number 1 (uint64) number 2 (uint64) number 1 (uint64) number 2 (uint64)

number (uint64)

BitOr

number (uint64)

BitXOr

number 1 (uint64) number 2 (uint64)

number (uint64)

BitExpand

number (uint64)

string

BitCompress

number (string)

number (uint64)

SetBit

origfield (uint64) bitlist (string) bitstate (uint8)

number (uint64)

B-4

Parallel Job Developers Guide

Functions

Mathematical Functions

Mathematical Functions
The following table lists the functions available in the Mathematical category (square brackets indicate an argument is optional):
Name
Abs Acos

Description
Absolute value of any numeric expression Calculates the trigonometric arc-cosine of an expression Calculates the trigonometric arc-sine of an expression Calculates the trigonometric arc-tangent of an expression Calculates the smallest integer value greater than or equal to the given decimal value Calculates the trigonometric cosine of an expression Calculates the hyperbolic cosine of an expression Outputs the whole part of the real division of two real numbers (dividend, divisor) Calculates the result of base 'e' raised to the power designated by the value of the expression Calculates the absolute value of the given value Calculates the largest integer value less than or equal to the given decimal value Calculates a number from an exponent and mantissa Returns the absolute value of the given integer

Arguments
number (int32) number (dfloat)

Output
result (dfloat) result (dfloat)

Asin

number (dfloat)

result (dfloat)

Atan

number (dfloat)

result (dfloat)

Ceil

number (decimal)

result (int32)

Cos

number (dfloat)

result (dfloat)

Cosh Div

number (dfloat) dividend (dfloat) divisor (dfloat)

result (dfloat) result (dfloat)

Exp

number (dfloat)

result (dfloat)

Fabs Floor

number (dfloat) number (decimal)

result (dfloat) result (int32)

Ldexp

mantissa (dfloat) exponent (int32) number (uint64)

result (dfloat)

Llabs

result (int64)

Parallel Job Developers Guide

B-5

Mathematical Functions

Functions

Name
Ln

Description
Calculates the natural logarithm of an expression in base 'e' Returns the log to the base 10 of the given value Returns the greater of the two argument values Returns the lower of the two argument values Calculates the modulo (the remainder) of two expressions (dividend, divisor) Negate a number Calculates the value of an expression when raised to a specified power (expression, power) Return a psuedo random integer between 0 and 2321 Returns a random number between 0 232-1 Calculates the trigonometric sine of an angle Calculates the hyperbolic sine of an expression Calculates the square root of a number Calculates the trigonometric tangent of an angle Calculates the hyperbolic tangent of an expression

Arguments
number (dfloat)

Output
result (dfloat)

Log10 Max Min Mod

number (dfloat) number 1 (int32) number 2(int32) number 1 (int32) number 2 (int32) dividend (int32) divisor (int32)

result (dfloat) result (int32) result (int32) result (int32)

Neg Pwr

number (dfloat) expression (dfloat) power (dfloat)

result (dfloat) result (dfloat)

Rand

result (uint32)

Random

result (uint32)

Sin

number (dfloat)

result (dfloat)

Sinh Sqrt Tan

number (dfloat) number (dfloat) number (dfloat)

result (dfloat) result (dfloat) result (dfloat)

Tanh

number (dfloat)

result (dfloat)

B-6

Parallel Job Developers Guide

Functions

Null Handling Functions

Null Handling Functions


The following table lists the functions available in the Null Handling category (square brackets indicate an argument is optional):
Name
IsNotNull

Description
Returns true when an expression does not evaluate to the null value Returns true when an expression evaluates to the null value Change an in-band null to out of band null Returns an empty string if input column is null, otherwise returns the input column value Returns zero if input column is null, otherwise returns the input column value Returns specified value if input column is null, otherwise returns the input column value Assign a null value to the target column

Arguments
any

Output
true/false (int8)

IsNull

any

true/false (int8)

MakeNull

any (column) string (string) input column

NullToEmpty

input column value or empty string input column value or zero

NullToZero

input column

NullToValue

input column, value

input column value or value

SetNull

true = 1 false = 0

Number Functions
The following table lists the functions available in the Number category (square brackets indicate an argument is optional):
Name
MantissaFromDecimal

Description
Returns the mantissa from the given decimal

Arguments
number (decimal)

Output
result (dfloat)

Parallel Job Developers Guide

B-7

Raw Functions

Functions

Name
MantissaFromDFloat

Description
Returns the mantissa from the given dfloat

Arguments
number (dfloat)

Output
result (dfloat)

Raw Functions
The following table lists the functions available in the Raw category (square brackets indicate an argument is optional):
Name
RawLength

Description
Returns the length of a raw string

Arguments
input string (raw)

Output
Result (int32)

String Functions
The following table lists the functions available in the String category (square brackets indicate an argument is optional):
Name
AlNum

Description
Return whether the given string consists of alphanumeric characters Returns 1 if string is purely alphabetic Return the string after reducing all consective whitespace to a single space Compares two strings for sorting

Arguments
string (string)

Output
true/false (int8)

Alpha CompactWhiteSpace

string (string) string (string)

result (int8) result (string)

Compare

string1 (string) string2 (string) [justification (L or R)]

result (int8)

ComparNoCase ComparNum

Case insensitive comparison of two strings Compare the first n characters of the two strings

string1 (string) string2 (string) string1 (string) string2 (string) length (int16)

result (int8) result (int8)

CompareNumNoCase Caseless comparison of the first n characters of the two strings

string1 (string) string2 (string) length (int16)

result (int8)

B-8

Parallel Job Developers Guide

Functions

String Functions

Name
Convert

Description
Converts specified characters in a string to designated replacement characters Count number of times a substring occurs in a string Count number of delimited fields in a string Change all uppercase letters in a string to lowercase Enclose a string in double quotation marks Return 1 or more delimited substrings

Arguments
fromlist (string) tolist (string) expression (string) string (string) substring (string) string (string) delimiter (string) string (string) string (string) string (string) delimiter (string) occurrence (int32) [number (int32)]

Output
result (string)

Count

result (int32)

Dcount

result (int32)

DownCase DQuote Field

result (string) result (string) result (string)

Index

Find starting character position of substring Leftmost n characters of string

string (string) substring (string) occurrence (int32) string (string) number (int32)

result (int32)

Left

result (string)

Len Num PadString

Length of string in characters Return 1 if string can be converted to a number Return the string padded with the optional pad character and optional length Rightmost n characters of string

string (string) string (string) string (string) padlength (int32) string (string) number (int32)

result (int32) result (int8) result (string)

Right

result (string)

Soundex

Returns a string which identifies a set of words that are (roughly) phonetically alike based on the standard, open algorithm for SOUNDEX evaluation Return a string of N space characters Enclose a string in single quotation marks Repeat a string

string (string)

result (string)

Space Squote Str

length (int32) string (string) string (string) repeats (int32)

result (string) result (string) result (string)

Parallel Job Developers Guide

B-9

String Functions

Functions

Name
StripWhiteSpace Trim

Description
Return the string after stripping all whitespace from it Remove all leading and trailing spaces and tabs plus reduce internal occurrences to one

Arguments
string (string) string (string) [stripchar (string)] [options (string)]

Output
result (string) result (string)

TrimB TrimF Trim Leading Trailing Upcase

Remove all trailing spaces and tabs Remove all leading spaces and tabs Returns a string with leading and trailing whitespace removed Change all lowercase letters in a string to uppercase

string (string) string (string) string (string)

result (string) result (string) result (string)

string (string)

result (string)

true = 1 false = 0 Possible options for the Trim function are: L Removes leading occurrences of character. T Removes trailing occurrences of character. B Removes leading and trailing occurrences of character. R Removes leading and trailing occurrences of character, and reduces multiple occurrences to a single occurrence. A Removes all occurrences of character. F Removes leading spaces and tabs. E Removes trailing spaces and tabs. D Removes leading and trailing spaces and tabs, and reduces multiple spaces and tabs to single ones.

B-10

Parallel Job Developers Guide

Functions

Vector Function

Vector Function
The following function can be used within expressions to access an element in a vector column. The vector index starts at 0.
Name
ElementAt

Description
Accesses an element of a vector

Arguments
input column index (int)

Output
element of vector

This can be used as part of, or the whole of an expression. For example, an expression to add 1 to the third element of an vector input column 'InLink.col1' would be:
ElementAt(InLink.col1, 2) + 1

Type Conversion Functions


The following table lists the functions available in the Type Conversion category (square brackets indicate an argument is optional):
Name
DateToString

Description
Return the string representation of the given date Returns the given decimal in decimal representation with specified precision and scale Returns the given decimal in dfloat representation Return the string representation of the given decimal Returns the given dfloat in decimal representation Returns the given dfloat in its string representation with no exponent, using the specified scale

Arguments
date [format (string)] decimal (decimal) [rtype (string)] [packedflag (int8)] number (decimal) [fix_zero] number (decimal) [fix_zero] number (dfloat) [rtype (string)] number (dfloat) scale (string)

Output
result (string)

DecimalToDecimal

result (decimal)

DecimalToDFloat DecimalToString DfloatToDecimal DfloatToStringNoExp

result (dfloat) result (string) result (decimal) result (string)

Parallel Job Developers Guide

B-11

Type Conversion Functions

Functions

Name
IsValid

Description
Return whether the given string is valid for the given type. Valid types are "date", "decimal", "dfloat", "sfloat", "int8", "uint8", "int16", "uint16", "int32", "uint32", "int64", "uint64", "raw", "string", "time", "timestamp". ustring Returns a date from the given string in the given format Returns the given string in decimal representation Returns a string in raw representation Returns a time representation of the given string Returns a timestamp representation of the given string Returns a date from the given timestamp Return the string representation of the given timestamp Returns the time from a given timestamp Return the string representation of the given time

Arguments
type (string) format (string)

Output
result (int8)

StringToDate

date (string) format (string) string (string) [rtype (string)] string (string) string (string) [format (string)] string (string) [format (string)] timestamp timestamp [format (string)] timestamp time [format (string)]

date

StringToDecimal StringToRaw StringToTime StringToTimestamp

result (decimal) result (raw) time timestamp

TimestampToDate TimestampToString TimestampToTime TimeTotring

date result (string) time result (string)

StringToUstring

Returns a ustring from the given string, optionally using the specified map (otherwise uses project default) Returns a string from the given ustring, optionally using the specified map (otherwise uses project default)

string (string) [,mapname (string)]

result (ustring)

UstringToString

string(ustring) [,mapname (string)]

result (string)

Rtype. The rtype argument is a string, and should contain one of the following: ceil. Round the source field toward positive infinity. E.g, 1.4 -> 2, 1.6 -> -1.

B-12

Parallel Job Developers Guide

Functions

Type Conversion Functions

floor. Round the source field toward negative infinity. E.g, 1.6 -> 1, -1.4 -> -2. round_inf. Round or truncate the source field toward the nearest representable value, breaking ties by rounding positive values toward positive infinity and negative values toward negative infinity. E.g, 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2. trunc_zero. Discard any fractional digits to the right of the rightmost fractional digit supported in the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, round or truncate to the scale size of the destination decimal. E.g, 1.6 -> 1, -1.6 -> -1. The default is trunc_zero. Format string. Date, Time, and Timestamp functions that take a format string (e.g., timetostring(time, stringformat)) need specific formats: For a date, the format components are: %dd two digit day %mm two digit month %yy two digit year (from 1900) %year_cutoffyy two digit year from year_cutoff (e.g. %2000yy) %yyyy four digit year %ddd three digit day of the year The default format is %yyyy-%mm-%dd For a time, the format components are: %hh two digit hour %nn two digit minutes %ss two digit seconds or %ss.x two digit seconds and microseconds to x decimal places The default is %hh:%nn:%ss A timestamp can include the components for date and time above. The default format is %yyyy-%mm-%dd %hh:%nn:%ss. Where your dates, times, or timestamps convert to or from ustrings, DataStage will pick this up automatically. In these cases the separators in your format string (for example, : or -) can themselves be Unicode characters. fix_zero. By default decimal numbers comprising all zeros are treated as invalid. If the string fix_zero is specified as a second argument, then all zero decimal values are regarded as valid.

Parallel Job Developers Guide

B-13

Type Casting Functions

Functions

Type Casting Functions


There is a special class of type conversion function to help you when performing mathematical calculations using numeric fields. For example, if you have a calculation using an output column of type float derived from an input column of type integer in a Parallel Transformer stage the result will be derived as an integer regardless of its float type. If you want a non-integral result for a calculation using integral operands, you can use the following functions (which act in a similar way as casting in C) to cast the integer operands into non-integral operands:
Name
AsDouble AsFloat

Description
Treat the given number as a double Treat the given number as a float Treat the given number as an integer

Arguments
number (number) number (number)

Output
number (double) number (float)

AsInteger

number (number)

number (int)

Utility Functions
The following table lists the functions available in the Utility category (square brackets indicate an argument is optional):
Name Description Arguments
environment variable (string)

Output
result (string)

GetEnvironment Return the value of the given environment variable

B-14

Parallel Job Developers Guide

C
Fillers
This appendix describes how fillers are created when you load columns from COBOL file definitions that represent simple or complex flat files. Since these file definitions can contain hundreds of columns, you can choose to collapse sequences of unselected columns into FILLER items. This maintains the byte order of the columns, and saves storage space and processing time. This appendix gives examples of filler creation for different types of COBOL structures, and shows how you can expand fillers later if you need to reselect any columns.

Creating Fillers
Unlike other parallel stages, Complex Flat File stages have stage columns. You load columns on the Columns tab of the Complex Flat File Stage page. After you select a table from the Table Definitions dialog box, the Select Columns From Table dialog box appears.

Parallel Job Developers Guide

C-1

Creating Fillers

Fillers

This dialog box has an Available columns tree that displays COBOL structures such as groups and arrays, and a Selected columns list that displays the columns to be loaded into the stage. The Create fillers checkbox is selected by default. When columns appear on the Columns tab, FILLER items are shown for the unselected columns. FILLER columns have a native data type of CHARACTER and a name of FILLER_XX_YY, where XX is the start offset and YY is the end offset. Fillers for elements of a group array or an OCCURS DEPENDING ON (ODO) column have the name of FILLER_NN, where NN is the element number. The NN begins at 1 for the first unselected group element and continues sequentially. Any fillers that follow an ODO column are also numbered sequentially. Level numbers of column definitions, including filler columns, are changed after they are loaded into the Complex Flat File stage. However, the underlying structure is preserved.

Filler Creation Rules


The rules for filler creation are designed to preserve the storage length of all columns being replaced by a filler column. They allow a filler column to be expanded back to the original set of defining columns, each having the correct name, data type, and storage length.

C-2

Parallel Job Developers Guide

Fillers

Creating Fillers

The basic filler creation rules are:


1

A filler column will replace one or more original columns with a storage length equal to the sum of the storage length of each individual column being replaced. Separate fillers are created when column level numbers decrease. For example, if an unselected column at level 05 follows an unselected column at level 10, separate fillers are created for the columns at the 05 and 10 levels. Any ODO column and its associated depending on column will be automatically selected and not replaced by a filler column. If a REDEFINE column is selected, the column that it is redefining is also automatically selected and will not be included as part of a filler column. If two fillers share the same storage offset (such as for a REDEFINE) the name of the subsequent fillers will be FILLER_XX_YY_NN, where NN is a sequential number that begins at 1. If the starting or ending column for a filler is the child of a parent that contains an OCCURS clause, then the generated FILLER name will be FILLER_NN instead of FILLER_XX_YY.

3 4

The remaining rules are explained through the following set of examples.

Filler Creation Examples


The following examples explain filler creation for groups, REDEFINES, and arrays using different scenarios for column selection. The source table contains single columns, a group that redefines a column, a nested group with an array, and a column that redefines another column, as shown:

Parallel Job Developers Guide

C-3

Creating Fillers

Fillers

Select a Simple Column


Suppose a simple column (A) is selected from the table, as shown:

On the Columns tab, a single filler is created with the name FILLER_2_11 and a length of 10. The length represents the sum of the lengths of columns B (8), E (1), and G (1). GRP1 and its elements, along with column F are excluded because they redefine other , columns that were not selected:

C-4

Parallel Job Developers Guide

Fillers

Creating Fillers

Select a Column Redefined by a Group


Now suppose column B is selected, which is redefined by GRP1:

Two fillers are created, one for column A and the other for columns E and G. Since GRP1 redefines column B, it is dropped along with its elements. Column F is also dropped because it redefines column E. The dropped columns will be available during filler expansion (see "Expanding Fillers" on page C-16):

Parallel Job Developers Guide

C-5

Creating Fillers

Fillers

Select a Group Column that Redefines a Column


Next, suppose a group column is selected (GRP1) that redefines another column (B):

This time three fillers are created: one for column A, one for columns C1 through D3 (which are part of GRP1), and one for columns E and G. Column B is preserved since it is redefined by the selected group column:

C-6

Parallel Job Developers Guide

Fillers

Creating Fillers

Select a Group Element


This example shows what happens when a group element (C1) is selected by itself:

Three fillers are created: one for column A, one for columns C2 through D3, and one for columns E and G. Since an element of GRP1 is selected and GRP1 redefines column B, both column B and GRP1 are preserved:

Parallel Job Developers Guide

C-7

Creating Fillers

Fillers

The next example shows what happens when a different group element is selected, in this case, column C2:

Four fillers are created: one for column A, one for column C1, one for GRP2 and its elements, and one for columns E and G. Column B and GRP1 are preserved for the same reasons as before:

C-8

Parallel Job Developers Guide

Fillers

Creating Fillers

Select a Group Array Column


Consider what happens when a group array column (GRP2) is selected and passed as is:

Four fillers are created: one for column A, one for columns C1 and C2, one for columns D1 through D3, and one for columns E and G. Since GRP2 is nested within GRP1, and GRP1 redefines column B, both column B and GRP1 are preserved:

Parallel Job Developers Guide

C-9

Creating Fillers

Fillers

If the selected array column (GRP2) is flattened, both occurrences (GRP2 and GRP2_2) appear on the Columns tab:

Select an Array Element


Suppose an element (D1) of a group array is selected:

If the array GRP2 is passed as is, fillers are created for column A, columns C1 and C2, columns D2 and D3, and columns E and G.

C-10

Parallel Job Developers Guide

Fillers

Creating Fillers

Column B, GRP1, and GRP2 are preserved for the same reasons as before:

Select a Column Redefined by Another Column


Lets see what happens when column E is selected, which is redefined by another column:

Parallel Job Developers Guide

C-11

Creating Fillers

Fillers

Two fillers are created: one for columns A through D3, and another for column G. Since column F redefines column E, it is dropped, though it will be available for expansion (see "Expanding Fillers" on page C-16):

Now suppose the REDEFINE column (F) is selected:

C-12

Parallel Job Developers Guide

Fillers

Creating Fillers

In this case column E is preserved since it is redefined by column F . One filler is created for columns A through D3, and another for column G:

Select Multiple Redefine Columns


This example describes how fillers are created for multiple redefine columns. In this case the same column is being redefined multiple times. The source table contains a column and a group that redefine column A, as well as two columns that redefine the group that redefines column A:

If columns C2 and E are selected, four fillers are created: one for column B, one for column C1, one for column D, and one for column F . Since an element of GRP1 is selected and GRP1 redefines column A, both column A and GRP1 are preserved. The first three fillers have the

Parallel Job Developers Guide

C-13

Creating Fillers

Fillers

same start offset because they redefine the same storage area, as shown:

Select Multiple Cascading Redefine Columns


This example shows filler creation for multiple redefine columns, except this time they are cascading redefines instead of redefines of the same column. Consider the following source table, where column B redefines column A, GRP1 redefines column B, column D redefines GRP1, and column E redefines column D:

If columns C2 and E are selected, this time only two fillers are created: one for column C1 and one for column F Column A, column B, GRP1, .

C-14

Parallel Job Developers Guide

Fillers

Creating Fillers

and column D are preserved because they are redefined by other columns, as shown:

Select an OCCURS DEPENDING ON Column


The final example how an ODO column is handled. Suppose the source table has the following structure:

Parallel Job Developers Guide

C-15

Expanding Fillers

Fillers

If column B is selected, four fillers are created as shown:

Fillers are created for column A, column C1, columns D1 through D3, and columns E and G. GRP1 is preserved because it redefines column B. Since GRP2 (an ODO column) depends on column C2, column C2 is preserved. GRP2 is preserved because it is an ODO column.

Expanding Fillers
After you select columns to load into a Complex Flat File stage, the selected columns and fillers appear on the Columns tab of the Stage page. If you need to reselect any columns represented by fillers, it is not necessary to reload your table definition. An Expand Filler... option allows you to reselect any or all of the columns from a given filler.

C-16

Parallel Job Developers Guide

Fillers

Expanding Fillers

To expand a filler, right-click the filler column in the columns tree and select Expand Filler... from the shortcut menu. The Expand Filler dialog box appears:

The contents of the given filler are displayed in the Available columns tree, allowing you to reselect those columns you need. In this example, you expanded FILLER_2_9. Suppose you select column

Parallel Job Developers Guide

C-17

Expanding Fillers

Fillers

C1 in the Expand Filler dialog box. The Columns tab now appears similar to this:

If you expand FILLER_3_9 and select column C2, the Columns tab now appears similar to this:

C-18

Parallel Job Developers Guide

Fillers

Expanding Fillers

If you continue to expand the fillers, eventually the Columns tab will contain all of the original columns in the table, as shown:

Parallel Job Developers Guide

C-19

Expanding Fillers

Fillers

C-20

Parallel Job Developers Guide

Index
A
Abs B5 Acos B5 Advanced tab Inputs page 345 Advanced tab Ouputs page 355 advanced tab, stage page 312 after-stage subroutines for Transformer stages 176, 1716 aggragator stage 181 aggragator stage properties 186 AlNum B8 Alpha B8 AsDouble B14 AsFloat B14 Asin B5 AsInteger B14 Atan B5 automatic type conversions 286 cluster systems 11 collecting data 27 collection types ordered 220 round robin 219 sorted merge 221 column auto-match facility 1611, 1711, 2117 column export stage 421 column export stage properties properties column export stage 425 column generator stage 281, 541 column generator stage properties 285, 546 column import stage 411 column import stage properties properties column import stage 374, 407, 416 columns tab, inputs page 326 columns tab, outputs page 351 combine records stage 451 combine records stage properties 458 CompactWhiteSpace B8 Compare B8 compare stage 341, 344 compare stage properties 344 CompareNoCase B8 CompareNum B8 CompareNumNoCase B8 complex data types 232 complex flat file input properties 1018 complex flat file output properties 1021 compress stage 251 compress stage properties 252 configuration file 26 configuration file editor 581 containers 233 Convert B9

B
before-stage subroutines for Transformer stages 176, 1716 BitAnd B4 BitCompress B4 BitExpand B4 BitOr B4 BitXOr B4

C
Ceil B5 change apply stage 321 change apply stage properties 325 change capture stage 311 change capture stage properties properties change capture stage 296, 314 Cluster systems 25

Book Title

Index-1

Index

copy stage 271, 291 copy stage properties 276 Cos B5 Cosh B5 Count B9

D
data set 226 data set stage 41 data set stage input properties 44 data set stage output properties 47 data types 228 complex 232 DateFromDaysSince B1 DateFromJulianDay B1 DateToString B11 DaysSinceFromDate B1 DB2 partition properties 324 DB2 partitioning 218 DB2 stage 121 DB2 stage input properties 1219 DB2 stage output properties 1239 Dcount B9 DecimalToDecimal B11 DecimalToDFloat B11 DecimalToString B11 decode stage 361 decode stage properties 362 defining local stage variables 1618, 1720 DfloatToDecimal B11 DfloatToStringNoExp B11 difference stage 331 difference stage properties 334 Div B5 documentation conventions iiiiv DownCase B9 DQuote B9

Expression Editor 1722 expression editor 1621, 2129 external fileter stage 301 external fileter stage properties 302 external source stage 81 external source stage output properties 85 external target stage 91 external target stage input properties 94

F
Fabs B5 Field B9 file set output properties 623 file set stage 61 file set stage input properties 65 filler creation and expansion 1014 Find and Replace dialog box 168, 178, 2114 Floor B5 Folder stages 121 format tab, inputs page 325 functions B1 funnel stage 221 funnel stage properties 228

G
general tab, inputs page 319 general tab, outputs page 348 general tab, stage page 38 Generic stage 391 generic stage properties 392 GetEnvironment B14

H
hash by field partitioning 212 head stage 491 head stage properties 495 HoursFromTime B1

I E
editing Transformer stages 167, 177, 2113 encode stage 351, 352 encode stage properties 352 entire partitioning 211 examples Development Kit program A1 Exp B5 expand stage 261 expand stage properties 262 Index B9 index organized tables (Oracle) 136, 1323 Informix XPS stage 151 Informix XPS stage input properties 1510 Informix XPS stage output properties properties Informix XPS stage output 1516 input links 165, 175 inputs page 314, 318 columns tab 326 format tab 325

Index-2

Book Title

Index

general tab 319 partitioning tab 320 properties tab 319 IsNotNull B7 IsNull B7 IsValid B12

MPP ssystems 25 MPP systems 11

N
Neg B6 NextWeekdayFromDate B2 Not B4 Nulls handling in Transformer stage input columns 1616 NullToEmpty B7 NullToValue B7 NullToZero B7 Num B9

J
join stage 191 join stage properties 197 JulianDayFromDate B2

L
Ldexp B5 Left B9 Len B9 level number 328 link ordering tab, stage page 314 links input 165, 175 output 165, 175 reject 165, 175 specifying order 1618, 1719, 2120 Llabs B5 Ln B6 Log10 B6 lookup file set stage 71 lookup file set stage output properties 79 lookup stage 211

O
optimizing performance 21 Oracle stage 131 Oracle stage input properties 1317 Oracle stage output properties 1328 ordered collection 220 output links 165, 175 outputs page 347 columns tab 351 general tab 348 mapping tab 352 properties tab 348

P
PadString B9 parallel engine configuration file 581 parallel processing 21 parallel processing environments 25 partial schemas 228 partition parallel processing 21 partition paralllelism 23 partitioning data 27 partitioning icons 224 partitioning tab, inputs page 320 partitioning types DB2 218 entire 211 hash by field 212 modulus 214 random 29 range 216 round robin 28 same 210 Peek stage 391, 521 peek stage 391, 521

M
make subrecord stage 431 make subrecord stage properties 436 make vector stage 471, 477 make vector stage properties 477 MakeNull B7 MantissaFromDecimal 289, B7 MantissaFromDFloat 289, B8 mapping tab, outputs page 352 Max B6 merge stage 201 merge stage properties 205 meta data 226 MicroSecondsFromTime B2 Min B6 MinutesFromTime B2 Mod B6 modulus partitioning 214 MonthDayFromDate B2 MonthFromDate B2

Book Title

Index-3

Index

peek stage properties 522 pipeline processing 21, 22 PreviousWeekdayFromDate B2 promote subrecord stage 461 promote subrecord stage properties 467 properties 245, 352 aggragator stage 186 change apply stage 325 combine records stage 458 compare stage 344 complex flat file input 1018 complex flat file output 1021 compress stage 252 copy stage 276 data set stage input 44 data set stage output 47 DB2 stage input 1219 DB2 stage output 1239 decode stage 362 difference stage 334 expand stage 262 external fileter stage 302 external source stage output 85 external stage input 94 file set input 65 file set output 623 funnel stage 228 generic stage 392 head stage 495 Informix XPS stage input 1510 join stage 197 lookup file set stage output 79 make subrecord stage 436 make vector stage 477 merge stage 205 Oracle stage input 1317 Oracle stage output 1328 peek stage 522 promote subrecord stage 467 sample stage 518 SAS data set stage input 114 SASA data set stage output 117 sequential file input 59 sequential file output 525 sort stage 2310 split subrecord stage 446 split vector stage 486 tail stage 504 Teradata stage input 1410 Teradata stage output 1417 trnsformer stage 1622, 2130

write range map stage input properties 556 properties tab, inputs page 319 properties tab, outputs page 348 properties tab, stage page 38 propertiesrow generator stage output 539 prperties column generator stage 285, 546 Pwr B6

R
Rand B6 Random B6 random partitioning 29 range partition properties 324 range partitioning 216 RawLength B8, B11 reject links 165, 175 remove duplicates stage 241, 245 remove duplicates stage properties 245 repartioning data 27 restructure operators splitvect 481 Right B9 round robin collection 219 round robin partitioning 28 row generator stage 531 row generator stage output properties 539 runtime column propagation 227, 351, 542, 4124, 4224

S
same partitioning 210 sample stage 511 sample stage properties 518 SAS data set stage 111 SAS data set stage input input properties 114 SAS data set stage output properties 117 SAS stage 381 schema files 228 SecondsFromTime B2 SecondsSinceFromTimestamp B2 sequential file input properties 59 sequential file output properties 525 sequential file stage 51 SetBit B4 SetNull B7 shared containers 233 shortcut menus in Transformer Editor 164, 174, 2112 Sin B6

Index-4

Book Title

Index

Sinh B6 SMP systems 11, 25 sort stage 231 sort stage properties 2310 sorted merge collection 221 soundex B9 Space B9 split subrecord stage 441 split subrecord stage properties 446 split vector stage 481 split vector stage properties 486 splitvect restructure operator 481 Sqrt B6 Squote B9 stage editors 31 stage page 38 advanced tab 312 general tab 38 link ordering tab 314 properties tab 38 stage validation errors 37 stages editing Sequential File 71 sequential file 51 Str B9 StringToDate B12 StringToDecimal B12 StringToRaw B12 StringToTime B12 StringToTimestamp B12 StringToUstring B12 StripWhiteSpace B10 subrecords 232 surrogate key stage 401 switch stage 371

TimestampFromTimet B2 TimestampToDate B12 TimestampToString B12 TimestampToTime B12 TimetFromimestamp B2 TimeToString B12 toolbars Transformer Editor 163, 173, 2111 Transformer Editor 172 link area 163, 173, 2111 meta data area 164, 174, 2112 shortcut menus 164, 174, 2112 toolbar 163, 173, 2111 transformer stage 161 transformer stage properties 1622, 2130 Transformer stages basic concepts 165, 175, 2113 editing 167, 177, 2113 Expression Editor 1722 specifying after-stage subroutines 1716 specifying before-stage subroutines 1716 Trim B10 TrimB B10 TrimF B10 TrimLeadingTrailing B10 type conversion functions 287 type conversions 286

U
UniVerse stages 51 Upcase B10 USS systems 11, 561 UstringToString B12

V
vector 233 visual cues 37

T
table definitions 227 tagged subrecords 232 tail stage 501 tail stage properties 504 Tan B6 Tanh B6 Teradata stage 141 Teradata stage input properties 1410 Teradata stage output properties 1417 TimeDate B2 TimeFromMidnightSeconds B2 TimestampFromDateime B2 TimestampFromSecondsSince B2

W
WeekdayFromDate B3 write range map stage 551 write range map stage input properties 556

Y
YeardayFromDate B3 YearFromDate B3 YearweekFromDate B3

Z
z/OS systems 561

Book Title

Index-5

Index

Index-6

Book Title

You might also like