DT 861 StudioUserGuide
DT 861 StudioUserGuide
DT 861 StudioUserGuide
Informatica Data Transformation Studio User Guide Version 8.6.1 November 2008 Copyright (c) 20012008 Informatica Corporation. All rights reserved. This software and documentation contain proprietary information of Informatica Corporation and are provided under a license agreement containing restrictions on use and disclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. This Software is protected by U.S. Patent Numbers and other Patents Pending. Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013(c)(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable. The information in this software and documentation is subject to change without notice. Informatica Corporation does not warrant that this software or documentation is error free. Informatica, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange, PowerMart, Metadata Manager, Informatica Data Quality, Informatica Data Explorer, Informatica Complex Data Exchange, Informatica On Demand Data Replicator, and Informatica B2B Data Exchange are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners. Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright Sun Microsystems. All rights reserved. Copyright 1985-2003 Adobe Systems Inc. All rights reserved. Copyright 1996-2004 Glyph & Cog, LLC. All rights reserved. This product includes software developed by Boost (http://www.boost.org/). Permissions and limitations regarding this software are subject to terms available at http://www.boost.org/LICENSE_1_0.txt. This product includes software developed by Mozilla (http://www.mozilla.org/). Your right to use such materials is set forth in the GNU Lesser General Public License Agreement, which may be found at http://www.gnu.org/licenses/lgpl.html. The Mozilla materials are provided free of charge by Informatica, as-is, without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. This product includes software developed by the Apache Software Foundation (http://www.apache.org/) which is licensed under the Apache License, Version 2.0 (the License). You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. This product includes software developed by SourceForge (http://sourceforge.net/projects/mpxj/). Your right to use such materials is set forth in the GNU Lesser General Public License Agreement, which may be found at http://www.gnu.org/licenses/lgpl.html. The SourceForge materials are provided free of charge by Informatica, as-is, without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. This product includes Curl software which is Copyright 1996-2007, Daniel Stenberg, <daniel@haxx.se>. All Rights Reserved. Permissions and limitations regarding this software are subject to terms available at http://curl.haxx.se/docs/copyright.html. Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. This product includes ICU software which is Copyright (c) 1995-2003 International Business Machines Corporation and others. All rights reserved. Permissions and limitations regarding this software are subject to terms available at http://www-306.ibm.com/software/globalization/icu/license.jsp. This product includes OSSP UUID software which is Copyright (c) 2002 Ralf S. Engelschall, Copyright (c) 2002 The OSSP Project Copyright (c) 2002 Cable & Wireless Deutschland. Permissions and limitations regarding this software are subject to terms available at http://www.opensource.org/licenses/mitlicense.php. This product includes Eclipse software which is Copyright (c) 2007 The Eclipse Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to terms available at http://www.eclipse.org/org/documents/epl-v10.php. libstdc++ is distributed with this product subject to the terms related to the code set forth at http://gcc.gnu.org/onlinedocs/libstdc++/17_intro/license.html. DISCLAIMER: Informatica Corporation provides this documentation as is without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of non-infringement, merchantability, or use for a particular purpose. The information provided in this documentation may include technical inaccuracies or typographical errors. Informatica could make improvements and/or changes in the products described in this documentation at any time without notice.
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Informatica Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Informatica Customer Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Informatica Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Informatica Web Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Informatica Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Informatica Global Customer Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
Chapter 3: Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Creating a Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Using the New Parser Wizard to Create a Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Creating a Parser by Editing the IntelliScript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Running a Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Platform-Independent Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Parser Component Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 4: Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Port Quick Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Port Component Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 AdditionalInputPort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
iii
iv
Table of Contents
Chapter 6: Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Defining Document Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Standard Properties of Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Format Component Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 BinaryFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 CustomFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 HtmlFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 RtfFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 TextFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 XmlFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Delimiters Component Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 CommaDelimited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 DelimiterHierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 HL7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Positional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 PostScript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 RTF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 SGML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 SpaceDelimited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 TabDelimited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Delimiter Subcomponent Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Delimiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 EnclosingDelimiters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Format Preprocessor Component Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 HtmlProcessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 RtfProcessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Mapping Mixed Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Mapping XSI Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Generating Valid XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Role of XSD in Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Role of XSD in Serialization and Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 User-Defined Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 System Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Mapping Anchors to Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Using Variables in Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Initializing Variables at Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Variable Component Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Multiple-Occurrence Data Holders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Destroying the Occurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Chapter 8: Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Marker and Content Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Other Anchor Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 How Anchors and Delimiters Work Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Mapping Content Anchors to Data Holders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Mapping to Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Mapping to Multiple-Occurrence Data Holders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Mapping to Mixed-Content Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Defining Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Where to Define Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Sequence of Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Select-and-Click Approach for Marker and Content Anchors . . . . . . . . . . . . . . . . . . . . . 75 Drag-and-Drop Approach for Content Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Using the IntelliScript to Define Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Standard Anchor Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 How a Parser Searches for Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Search Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Search Scope and Search Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Adjusting the Search Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Adjusting the Search Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Adjusting the Search Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Using XSD Data Types to Narrow the Search Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Anchors that Contain Nested Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Anchor Quick Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Anchor Component Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
vi Table of Contents
DelimitedSections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 EmbeddedParser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 EnclosedGroup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 FindReplaceAnchor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 HtmlForm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Marker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 RepeatingGroup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Searcher Component Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 AttributeSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 LearnByExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 NewlineSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 OffsetSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 PatternSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 SegmentSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 TextSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 TypeSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Anchor Subcomponent Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 AddField . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 ImageClick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 ModifyField . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 RemoveField . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 SegmentIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 SegmentSize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 SubmitAll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 SubmitClick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
BidiConvert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 BigEndianUniToUni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 CDATADecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 CDATAEncode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 ChangeCase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 CreateGuid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 CreateUUID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 DateFormatICU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Dos96HebToAscii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 EbcdicToAscii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 EncodeAsUrl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 ExternalTransformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 FormatNumber . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 FromFloat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 FromInteger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 FromPackDecimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 FromSignedDecimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 hebrewBidi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 HebrewDosToWindowsTransformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 HebrewEBCDICOldCodeToWindows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 hebUniToAscii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 hebUtf8ToAscii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 HtmlEntitiesToASCII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 HtmlProcessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 InjectFP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 InjectString . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 JavaTransformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 LookupTransformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 NormalizeClosingTags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 ODBCLookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 RegularExpression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 RemoveMarginSpace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 RemoveRtfFormatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 RemoveTags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Replace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Resize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 ReverseTransformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 RtfProcessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 RtfToASCII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 SubString . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 ToFloat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 ToInteger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 ToPackDecimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 ToSignedDecimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 TransformationStartTime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
viii
Table of Contents
TransformByParser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 TransformByProcessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 TransformByService . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 TransformerPipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 WestEuroUniToAscii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 XSLTTransformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Transformer Subcomponent Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 InlineTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 ODBC_Text_Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 XMLLookupTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
XSLTMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Action Subcomponent Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 COMClass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 MSMQOutput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 ODBC_XML_Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 OpenURL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 OutputCOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 OutputDataHolder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 OutputFile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 ResultFile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Mapper Component Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Mapper Anchor Component Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 AlternativeMappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 EmbeddedMapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 GroupMapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 RepeatingGroupMapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
xii
Table of Contents
Removing a Deployed Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Deploying a Service to a Production Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 Running a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Table of Contents
xiii
xiv
Table of Contents
Preface
The Data Transformation Studio User Guide is written for developers, analysts, and other users who are responsible for designing and implementing transformations. The book explains how to design, configure, test, and deploy transformations by using Data Transformation Studio. It contains detailed reference sections documenting the transformation components and their properties. Before reading this book, you need a basic knowledge of how to use Data Transformation. You can obtain that knowledge by performing the hands-on lessons in Getting Started with Data Transformation. In parallel with this book, you can refer to Using Data Transformation in Eclipse for instructions on using the Studio menus, toolbars, views, and editors.
Informatica Resources
Informatica Customer Portal
As an Informatica customer, you can access the Informatica Customer Portal site at http://my.informatica.com. The site contains product information, user group information, newsletters, access to the Informatica customer support case management system (ATLAS), the Informatica Knowledge Base, Informatica Documentation Center, and access to the Informatica user community.
Informatica Documentation
The Informatica Documentation team takes every effort to create accurate, usable documentation. If you have questions, comments, or ideas about this documentation, contact the Informatica Documentation team through email at infa_documentation@informatica.com. We will use your feedback to improve our documentation. Let us know if we can contact you regarding your comments.
xv
support@informatica.com for technical inquiries support_admin@informatica.com for general customer service requests
WebSupport requires a user name and password. You can request a user name and password at http:// my.informatica.com.
Use the following telephone numbers to contact Informatica Global Customer Support:
North America / South America Informatica Corporation Headquarters 100 Cardinal Way Redwood City, California 94063 United States Europe / Middle East / Africa Informatica Software Ltd. 6 Waltham Park Waltham Road, White Waltham Maidenhead, Berkshire SL6 3TN United Kingdom Asia / Australia Informatica Business Solutions Pvt. Ltd. Diamond District Tower B, 3rd Floor 150 Airport Road Bangalore 560 008 India Toll Free Australia: 1 800 151 830 Singapore: 001 800 4632 4357 Standard Rate India: +91 80 4112 5738
Standard Rate Brazil: +55 11 3523 7761 Mexico: +52 55 1168 9763 United States: +1 650 385 5800
Standard Rate Belgium: +32 15 281 702 France: +33 1 41 38 92 26 Germany: +49 1805 702 702 Netherlands: +31 306 022 797 United Kingdom: +44 1628 511 445
xvi
Preface
CHAPTER 1
Designing Transformations
This chapter includes the following topics:
Overview, 1 Transformation Architecture, 1 Project Architecture, 3 Workflow for Designing Transformations, 4 Online Samples, 6
Overview
Data Transformation Studio is the design and configuration environment of the Informatica Data Transformation system. Using Data Transformation Studio, you can design and implement transformations that operate on any kind of data. This book is a learning and reference manual for designing transformations. The book contains:
Explanations of the Data Transformation concepts Details on how to use all the Data Transformation components, such as parsers, serializers, transformers, mappers, anchors, and actions Examples and tips on how to design transformations that work with many different kinds of input and output Instructions for deploying transformations that you have designed in Data Transformation Studio to the Data Transformation Engine runtime environment
Transformation Architecture
When you construct a transformation, you build it in modular fashion from components of the Data Transformation system. The components are arranged in a hierarchy or tree, which you can view in the IntelliScript editor of Data Transformation Studio. The components work with input and output documents and with the data holders that store Data Transformation data. This section provides a brief overview of the components and terminology that are used in this architecture. For detailed information about each component type, see the following chapters of this book.
1
Transformation Components
Top-Level Components
At the top level of the hierarchy, a transformation can run a parser, serializer, mapper, transformer, or streamer. These components are defined as follows:
Component Parser Serializer Mapper Transformer Streamer Description A component that converts source documents in any format to XML. A component that converts XML documents to output documents in any format. A component that converts XML documents to a different XML structure or schema. A component that modifies data. The input and output can be in any format. A component that splits large inputs into segments that are processed separately by the other components.
Of these component types, parsers, serializers, and mappers are the most powerful and generally useful. By running a parser and serializer in sequence, for example, you can convert any format to any other format. Using these components, you can perform conversions of unlimited complexity. As top-level components, transformers are useful for relatively simple data conversions, such as replacing predefined strings. Usually, the input and output documents have the same, non-XML format. Because of this limitation, transformers are more often used as nested components, and not as top-level components. For more information, see Nested Components on page 2. Streamers are special-purpose components for processing large inputs such as gigabyte data streams. They do not perform transformations on their own. Instead, they activate other components such as parsers to perform the transformations.
Note: There is a distinction between transformation, which is the generic term for the operations that Data
Transformation performs on data, and transformer, which is a specific type of Data Transformation component.
Nested Components
Within a parser, serializer, or mapper component, you can nest components such as:
Component Formats Document processors Anchors Serialization anchors Mapper anchors Actions Description Define the overall format of documents, such as the delimiters, that Data Transformation should use to interpret the documents. Operate on a document as a whole, performing preliminary or final conversions. Define the data in a source document that a parser should process and extract. The anchors specify how a parser should search for the data and where it should store the data that it finds. Define how a serializer should write XML data to an output document. Serialization anchors are the inverse of anchors. An anchor writes data from a source document to XML, whereas a serialization anchor writes data from XML to an output document. Define how a mapper should write XML data to another XML structure or schema. The anchors specify where to find the data in the source XML and where to write the data in the output XML. Perform operations on data in the scope of a transformation, for example, concatenating strings that a parser has extracted from a source document, summing numbers that a serializer finds in an XML input document, or querying a database for additional data. In addition to their use as top-level components, you can nest transformers within a parser or a serializer. For example, within a parser, you can nest a transformer that modifies the output of the anchors. As a nested component, a transformer operates on a portion of a document, not on the complete document.
Transformers
Indirectly, you can also nest parsers and serializers within each other. For example, within a parser, you can nest an action that runs another parser on a portion of the same document or on a second document.
Subcomponents
In addition to the main components that are described above, Data Transformation has a large number of subcomponents that are used for special purposes within the main components. It is also possible to develop custom components, such as custom document processors or custom transformers, to serve special needs.
Data Holders
Data holders are the XML elements, XML attributes, and variables that transformations use for data storage. The elements and attributes are defined in XSD schemas, which are standard XML schema definitions. Data Transformation uses XSD to define data holders, to help it process XML input, and to help it construct valid XML output. The variables are defined in the Data Transformation configuration, using XSD data types. For more information about data holders and XSD schemas, see Data Holders on page 55.
Documents
The input and output of a transformation are called the source document and the output document. A document can have any size. It can contain any text or binary data. It can be stored or accessed in a file, buffer, stream, database, messaging system, or any other location. For a parser, the source document can have any format. The output document of a parser has an XML format. The source document of a serializer is XML, and the output document can have any format. For a mapper, both the source and the output are XML. For a transformer, the source and output can have any format. A special source document is called the example source. The example source is a document associated with a parser, serializer, and mapper, used to help configure and test the transformation. XML is the common language connecting transformations together. For example, you can run a parser that converts a source document from any format to XML, and a serializer that converts the XML to any output format. By chaining the parser and serializer together, you can convert any input format to any output format. For more information about document structures, see Formats on page 43.
Project Architecture
A transformation is stored in a project. Each project has a project folder that contains the project files.
Project Architecture
Results folder
Examine whether the document structure is amenable to parsing. Sometimes, a simple step such as converting the document to an alternative format, which you can do by applying a document processor, makes the document much easier to parse. Plan which data you need to extract from the source document and where you will insert the extracted data in the XML. In Data Transformation, you implement the data extraction by using the Content anchor. Analyze the structure of the source document and identify the features you can use to locate the data fields. In Data Transformation, these features translate to Marker anchors or to various other types of anchors. Find repetitive or structured features of the documents that might help extract the data. You can implement such features using anchors such as Group, EnclosedGroup, RepeatingGroup, or DelimitedSections. Decide whether you need to transform any of the data during or after the extraction process. The Data Transformation components that operate on the extracted data are transformers and actions. Determine whether there are any additional data sources, such as a linked document or a database, that you need to access to prepare the output. Access such data by using certain anchors, transformers, or actions.
For a serializer or a mapper, you can invert the steps. Plan the data that you need to extract from the source XML and where to insert it in the output document. It is often easier to design a serializer or a mapper than a parser because the input is fully structured XML.
Create a new Data Transformation project. Add one or more XSD schemas to the project. The schemas must define the XML elements and attributes with which you will work.
3.
Create the top-level transformation component such as a parser or serializer, and define its properties. For a parser, you might define an example source document and a format component. The format component can contain features such as a delimiters definition and transformers. For a serializer, you might define properties such as the type of output file. You can also create a serializer automatically, by inverting a parser that you have already created. The parser or serializer appears in the IntelliScript editor of Data Transformation Studio. The example source appears in the example pane of the IntelliScript editor.
4.
Configure the subcomponents of the transformation. If the top-level component is a parser, the main components are anchors. You can define the anchors by graphical procedures or you can edit the IntelliScript. Data Transformation Studio helps you do this by highlighting and color-coding the anchors in the example source. For a serializer, the main components are serialization anchors, which you can create in the IntelliScript.
5.
Use the Data Transformation Studio tools to test and execute the transformation. Correct any configuration errors that you detect during the testing.
Online Samples
As you use this book, you can view online samples that illustrate many Data Transformation features. The samples are Data Transformation projects, located in the samples subfolder of the main Data Transformation installation folder. The default location is:
c:\Program Files\Informatica\DataTransformation\samples
To view the samples, import the projects to Data Transformation Studio. For more information about importing projects, see Using Data Transformation Studio in Eclipse. In addition to the project samples, the samples folder contains sample program code for features such as custom processors and transformers.
CHAPTER 2
Overview, 7
Overview
Data Transformation Studio is the design and configuration environment of Data Transformation. You use it to develop and edit Data Transformation projects. If you have performed the exercises in Getting Started with Data Transformation, you already have experience using Data Transformation Studio. The exercises teach many aspects of the Studio operation.
Figure 2-1. Data Transformation Studio
Eclipse is a versatile platform, designed to support both Java development and plug-in development tools. Data Transformation Studio works seamlessly within Eclipse, allowing you to develop transformations easily. The Eclipse platform is supplied at no additional cost with the Data Transformation software. You do not need any previous experience with Eclipse to use Data Transformation Studio. For more information about using the Eclipse platform, see Using Data Transformation Studio in Eclipse.
CHAPTER 3
Parsers
This chapter includes the following topics:
Creating a Parser
Parsers are Data Transformation components that convert a source document to XML. The output of a parser is always XML. The input can have any format, such as text, HTML, Word, PDF, or HL7. The input can even be an XML document that the parser processes as string data. This chapter explains the procedures for creating and running a parser component. Further information, such as how to support specific document formats and how to define the anchors that process the text of a source document, is in the succeeding chapters. You can create a parser by either of the following methods:
By using the New Parser wizard By editing the IntelliScript and inserting a Parser component
Click File > New > Project. Under the Data Transformation category, select a Parser Project and click Next. Follow the wizard prompts to enter the parser options.
When you finish, the Data Transformation Explorer view displays the new project containing the parser. The Component view displays the parser.
To create a new parser in an existing project: 1. 2.
Click File > New > Parser. Follow the wizard prompts to enter the parser options. When you finish, the Data Transformation Explorer view displays a new TGP script file defining the parser. The Component view displays the parser.
The following table describes the wizard options. At each stage, the wizard suggests options that seem appropriate based on your previous entries. If the wizard does not suggest the precise options that you need, you can refine the configuration afterwards by editing the IntelliScript.
Table 3-1. Options in the New Parser Wizard
Option Project name Project contents Parser name Script name Schema file path Source type, source path Description An identifier for the project. The storage location of the project folder. The default is the Studio workspace folder. A name for the parser. A name for a TGP script file, where the wizard stores the parser definition. The name of an XSD schema that defines the XML structure of the parser output. Define an example source document that you will use to configure and test the parser. The example source should illustrate the features that you expect the parser to process in production documents. Choose the example source carefully. After you configure a parser, it might be difficult to change the example source. You can select the following source types: - File: Browse to a file on the local computer or network. - Text: Type a text string that the parser will use as an example source. - None: Do not use an example source. You can add an example source later, or you can configure the parser without using an example source. Select the content type of the source documents, such as ASCII or Binary. If required, select a document processor that converts the source documents to a format that is amenable for parsing. The wizard suggests processors that seem appropriate for the content type. For example, if you select a Microsoft Word example source, the wizard suggests processors that convert Word documents to text, HTML, or XML formats. Select the format of the source documents, for example, tab-delimited or HTML. The wizard suggest formats that seem appropriate for the content type. If you selected a document processor, the format is that of the processor output, rather than the original source format.
Format
10
Chapter 3: Parsers
Display the parser in an IntelliScript editor. You can do this by double-clicking the parser in the Component view or by double-clicking the TGP script file in the Data Transformation Explorer.
Figure 3-1. IntelliScript Editor
2.
Under the contains line, add a sequence of nested anchors and actions.
3.
Run and test the parser and modify the IntelliScript as required. For more information, see Running a Parser on page 12.
At the top level of the IntelliScript, select the three dots (...) symbol. Press Enter and type a name for the parser. To the right of the name, press Enter. Select a Parser component from the list. Expand the tree under the Parser component. Assign its properties such as the example_source and the format. If necessary, add an XSD schema defining the XML syntax of the parser output. For more information, see Data Holders on page 55.
5.
Under the contains line, add a sequence of nested anchors and actions. For more information, see Anchors on page 71 and Actions on page 137.
6.
Run and test the parser and modify the IntelliScript as required. For more information, see Running a Parser on page 12.
Creating a Parser
11
Running a Parser
To run a parser in Data Transformation Studio, follow this procedure. For more information, see Running and Testing Projects on page 225.
To run a parser: 1.
In the IntelliScript editor or in the Component view, right-click the parser and click Set as Startup Component. Alternatively, click Run > Run and set the startup component in the dialog box.
2. 3. 4.
Click Run > Run MyParser, where MyParser is the name of the parser that you have set as the startup component. After a few seconds, the Studio displays the Events view. Examine it for any failures or warnings. To display the parsing results, double-click the file Results\output.xml in the Data Transformation Explorer view.
Platform-Independent Parsers
Data Transformation runs on both Microsoft Windows and UNIX-type systems. Most parser features run equally well on both platforms. There are a few exceptions to this rule. If you plan to run a parser on both Windows and UNIX, here are a few tips that can help ensure platform independence.
Document Processors
Use document processors that do not have platform-specific system requirements. For more information, see Document Processors on page 23. For example, use the ExcelToXml processor instead of ExcelToHTML. The former is platform-independent. The latter requires that Microsoft Excel be installed on the computer.
Custom Components
Data Transformation supports custom document processors, transformers, and actions. Use platformindependent versions of the custom components, such as:
ExternalJavaPreProcessor, ExternalPreProcessor
programmed in Java
and UNIX Do not use ExternalCOMPreProcessor and ExternalCOMAction, which are supported only on Windows. For more information about external components, see the Data Transformation Engine Developer Guide.
Newline Markers
Avoid defining Marker anchors that search for a newline character followed by a carriage return character ( \n\ r). This combination is commonly used in Windows but often not in UNIX. Instead, configure a Marker with the built-in NewlineSearch component, which searches for both the \n\r sequence and the \n or \r character alone.
12
Chapter 3: Parsers
File Paths
Use relative, as opposed to absolute, file paths. Remember that file paths on UNIX are case-sensitive.
Parser
A Parser is a component that converts a source document to XML. A Parser contains many nested components. Directly under the contains line of the Parser, you can nest anchors and actions. Under various Parser properties, you can assign components such as formats, delimiters, document processors, and transformers. For detailed information about all these components, see the following chapters of this book.
Example
The following is an example of a parser that processes tab-delimited text documents.
Description An example source document that you use to configure the parser operation. The document should be representative of the source documents that the parser will process. The value of the property is an input port such as LocalFile or Text. For more information, see Ports on page 15. Nested within this property, you can assign a preprocessor that converts the source documents to a format that the parser can accept. For more information, see Document Processors on page 23. To view the example source, right-click the parser and click Open Example Source. Specifies the format of the source document, such as: - Whether the document contains text, HTML, or binary code - The delimiters that separate data fields in the document - Transformers that the parser should apply by default to all Content anchors For more information, see Formats on page 43.
format
13
Description If selected, the parser does not parse the same page twice in the same execution. This is useful, for example, if a parser is following the links on a web site, and you want to prevent it from parsing duplicate links to the same page. The ResetVisitedPages action resets the history list and allows a parser to process a page again, even if reject_recurring_pages is selected. If selected, the parser runs without an initial phase. Components that are configured to run in the initial phase run in the main phase, instead. A specification of the source documents that the parser should process. The value of the property is an input port. For more information, see Ports on page 15. If you assign sources_to_extract and you run the parser in Data Transformation Studio, the parser processes the specified documents. If you leave sources_to_extract blank, the parser processes the example_source. This property specifies how the Studio should process portions of the example source that the parser does not output to XML, when you create a serializer from a parser. For more information, see Controlling How the Create Serializer Command Works on page 168. The possible values of the serialization_mode are: Full. The Create Serializer command copies the non-XML text to the serializer configuration. Outline. The Create Serializer command copies only the delimiters of the non-XML text to the serializer configuration. Under the Outline option, you can select the use_markers option. This causes the Create Serializer command to copy the content of the Marker anchors but only the delimiters of other non-XML text. A name that you assign to the parser. The name is displayed in the event log. A comment describing the parser. This property contains simulated values that another transformation might pass to the parser. The property is useful when designing a parser that is to be activated by another parser. Data Transformation uses the property only when it learns the example source. It ignores the property when it parses a source document. In the nested ExampleValue components, specify the data holders that the main parser passes to this parser and their simulated values. These properties are useful in situations where the parser must select specific occurrences of data holders. For more information, see Locators, Keys, and Indexing on page 191. If the parser fails, writes an entry in the user log. For more information, see Failure Handling on page 231.
no_initial_phase
sources_to_extract
serialization_mode
source target
on_fail
Online Samples
In the samples and tutorials folders, you can find many examples of parsers.
14
Chapter 3: Parsers
CHAPTER 4
Ports
This chapter includes the following topics:
Overview
A port is a component that specifies an input or output of a transformation, such as a source document or an output document. For example, in a Parser component, the values of the example_source and sources_to_extract properties are input ports. A port can define a document that is stored on the local computer, on a network, or in a string. Ordinarily, a port defines a default input or output, for example, a file that is used to develop and test the transformation. At runtime, the application that activates the transformation specifies the actual input or output document, overriding the defaults. For example, if you use the Data Transformation API to activate a parser, the API application specifies the input document that the parser should process, overriding the example_source. In many transformations, the ports are implicitly defined. For example, the default result file of a parser is a file called output.xml. You do not need to define an output port that references the file output.xml. By default, each transformation has a single input and a single output. Optionally, you can configure transformations that have multiple input and output ports.
15
Description Defines an additional, non-default input of a transformation. Defines an additional, non-default output of a transformation. Defines a list of documents. Defines a search criterion for a file. Defines a document by referencing an additional input port. Defines a file path. Defines a text string that is the input of a transformation. Defines a URL where a document is located.
Example
Suppose you have two text files:
IdsAndSalaries.txt IdsAndNames.txt
You want to parse these files jointly, generating an XML output file containing the employee names and salaries. You can configure the transformation in the following way:
The main parser, called EmployeeParser, processes IdsAndSalaries.txt. The main parser activates a secondary parser, called IdsToNamesParser, which processes IdsAndNames.txt and stores the result in an XML table. The main parser uses a LookupTransformer to convert the IDs to names. The lookup table is the output of the secondary parser.
16
Chapter 4: Ports
The following IntelliScript illustrates this configuration. The secondary parser references an AdditionalInputPort that retrieves the file IdsAndNames.txt.
Description A data holder where the system stores the content of the input when the transformation begins. Encoding of the input, such as a code page. For more information about encoding support, see Encoding Properties on page 218. The default location of the additional input. The value is an input-port component such as LocalFile or Text.
input_encoding
example_source
Description The name of a preprocessor that the transformation should apply to the files. For more information, see Document Processors on page 23. If selected, the transformation ignores the component.
disabled
AdditionalOutputPort
This component defines an additional output port. The component enables you to configure a transformation that generates output in multiple, dynamically defined locations.
To define an additional output port: 1. 2. 3.
At the global level of the IntelliScript, insert an AdditionalOutputPort component, and assign it a name. Nested within the transformation, insert a WriteValue action, and configure it with an OutputPort that references the name. Disable the default output to the result file:
On the menu, click Project > Properties. Display the Output Control page. Select the option to Disable Automatic Output.
17
If you want to obtain output in the default result file, insert an additional WriteValue action configured with the ResultFile option.
When you run the transformation in the Studio, the system defines a file name for the additional output, and it stores the file in the Results folder of the project. For example, if the port is called MyOutputPort, the file name might be output_MyOutputPort.xml.
To ascertain the file name of the additional output: 1. 2.
Click Run > Run. Click Details to display the I/O Ports table. The table displays the name of each AdditionalOutputPort and its output file.
When you deploy the transformation as a Data Transformation service, an application that runs the service can pass the additional output location as a parameter. For example, the location might be a buffer. For more information, see the Data Transformation Engine Developer Guide and the API references.
Example
A parser generates the following XML structure:
<Person gender="M"> <Name> <First>Ron</First> <Last>Lehrer</Last> </Name> <Id>547329876</Id> <Age>27</Age> </Person>
The first WriteValue writes the entire Person element to the default results file.
18
Chapter 4: Ports
The second WriteValue references an AdditionalOutputPort to write the nested Name element to another file.
Description The encoding of the additional output, such as a code page. For more information about encoding support, see Encoding Properties on page 218.
Description If selected, the transformation ignores the component. The file extension for the output file. The default is .xml. Enables you to configure properties of the output such as an XML header. For more information about the output properties, see Output Control Properties on page 222 and XML Generation Properties on page 223.
DocList
A document list. The component allows you to specify multiple source documents that a transformation should process. Within the component, you can nest multiple input ports such as FileSearch, LocalFile, or Text, each of which specifies a single document.
FileSearch
The criteria for a file search. You can use this input port, for example, in the sources_to_extract property of a Parser. It lets you specify source documents using wildcards.
Table 4-5. Basic Properties
Property
directory wildcard
Description The folder to be searched. The search criterion. You may use * as a wildcard character. For example, *.txt finds all text files. The default is *.*, which finds all files in the directory.
Description If selected, the search includes subfolders of the specified directory. The name of a preprocessor that the transformation should apply to the files. For more information, see Document Processors on page 23.
19
InputPort
This component specifies that the input should be taken from a named port that is defined by using an AdditionalInputPort component.
Table 4-7. Basic Properties
Property input Description The name of the AdditionalInputPort component defining the input.
LocalFile
A file on the local computer.
Table 4-8. Basic Properties
Property
file_name
Description A URL that Data Transformation should assign to the file. This property instructs Data Transformation to treat the file as if it were located on a web server. If the file contains relative links, Data Transformation resolves the links relative to the URL. The host-name portion of the URL is not case sensitive. Internally, Data Transformation processes HTTP host names as lower case. The name of a preprocessor that the transformation should apply to the files. For more information, see Document Processors on page 23.
pre_processor
OutputPort
This component specifies a named port that is defined by using an AdditionalOutputPort component. You can use an OutputPort component in a WriteValue action.
Table 4-10. Basic Properties
Property
port
Text
A text string.
Table 4-11. Basic Properties
Property
quote
Description A URL that Data Transformation should assign to the string. This property instructs Data Transformation to treat the string as if it were a file located on a web server. If the string contains relative links, Data Transformation resolves the links relative to the URL.
20
Chapter 4: Ports
Description The name of a preprocessor that the transformation should apply to the files. For more information, see Document Processors on page 23. A static size for the text buffer. This property is typically used when working with binary sources. The default is -1, which means that the buffer is dynamically sized.
size
URL
Note: This component is provided for compatibility with projects created in earlier Data Transformation
versions. It is being phased out of the Data Transformation system. Do not use it in new projects. The URL of a document that is available on a web server.
Table 4-13. Basic Properties
Property
stable_url
Description The URL address, for example, http://www.example.com/index.html. The host name, www.example.com, is not case sensitive. Internally, Data Transformation processes HTTP host names as lower case.
Description Data that the transformation should post to the URL. To determine the correct format of the data string, you can use the technique described in the SubmitForm action. If the transformation cannot access the URL on the first attempt, the number of retries that it performs before reporting a failure. Default = 0. The number of seconds to wait between retries. Default = 60. The name of a preprocessor that the transformation should apply to the files. For more information, see Document Processors on page 23.
retries
seconds_to_wait pre_processor
21
22
Chapter 4: Ports
CHAPTER 5
Document Processors
This chapter includes the following topics:
Overview, 23 Defining Document Processors, 23 Document Processor Quick Reference, 25 Document Processor Component Reference, 25 TextML XML Schema, 36 PdfToTxt_4 Table Configuration Editor, 36
Overview
Document processors are components that convert the format of a complete document to another format that is desired for processing. You can use a document processor as a pre-processor that converts the format of a source document prior to a transformation. For example, if the source document of a parser is in the PDF format, you might apply the PdfToTxt_4 processor. This converts the source document to text, which is much easier to parse than the binary PDF format. Do not confuse document processors with format preprocessors. For more information about format preprocessors, see Formats on page 43.
Installation
The document processors are supplied in an optional setup component. If you plan to use the processors, select the option to install the Processors when you run the Data Transformation setup.
23
Assign the example_source property of the transformation.The value of the example_source is an input port, such as LocalFile or Text. For more information, see Ports on page 15.
2.
Assign the pre_processor property of the input port. Data Transformation applies the processor that you define under example_source to all sources on which you run the transformation.
3.
If you use the processor in a serializer or mapper project, add the XSD schema of the processor output to the project. In a parser project, you do not need to add the schema.
Note: You can also define a pre-processor in the sources_to_extract property of a parser. The processor that you define there applies only to the source documents that you define in sources_to_extract, and not to any other document that the parser processes.
24
Description Converts the IBM Advanced Function Presentation print-stream format to XML. Converts Microsoft Excel data to XML, without preserving Excel formulas, formatting, or code. Converts Microsoft Excel documents to HTML. Converts Microsoft Excel files to the TextML XML schema. Converts Microsoft Excel documents to plain text. Converts Microsoft Excel documents to XML, while preserving the Excel formulas, formatting, and optionally macro code. Opens an HTML frameset, letting a parser run on the content of the frames. Runs a custom document processor, implemented as a COM DLL. Runs a custom document processor, implemented in Java. Runs a custom document processor, implemented as a C++ DLL. Converts PDF forms to XML. Converts PDF documents to plain text. Converts Microsoft PowerPoint documents to HTML. Converts Microsoft PowerPoint presentations to the TextML XML schema. Runs transformers as document processors. Runs a sequence of document processors on a single document. Converts RTF files to the TextML XML schema. Converts Corel WordPerfect documents to the TextML XML schema. Converts Microsoft Word documents to HTML. Converts Microsoft Word documents to RTF. Converts Microsoft Word files to the TextML XML schema. Converts Microsoft Word documents to plain text. Converts Microsoft Word documents to XML. Converts XML to document formats such as PDF, Word, Excel, PowerPoint, PostScript, and HTML. Converts XML documents to Microsoft Excel.
ExpandFrameSet ExternalCOMPreProcessor ExternalJavaPreProcessor ExternalPreProcessor PdfFormToXml_1_00 PdfToTxt_4 PowerpointToHtml PowerpointToTextML ProcessByTransformers ProcessorPipeline RtfToTextML WordPerfectToTextML WordToHtml WordToRtf WordToTextML WordToTxt WordToXml XmlToDocument
XmlToExcel
AFPToXML
This document processor converts the IBM Advanced Function Presentation print-stream format to XML. The processor output is in the UTF-8 encoding. If a transformation receives input from the processor, you must set the input encoding of the project to UTF-8. For more information, see Encoding Properties on page 218.
Document Processor Quick Reference 25
ExcelToDataXml
This document processor converts Microsoft Excel documents to XML. The XML contains the data and the results of formulas that existed in the original Excel document. It does not preserve the formulas themselves, formatting information, or macro code. In cases where the latter information is required, use ExcelToXml rather than ExcelToDataXml. The XML representation conforms to a subset of the ExcelToXml.xsd schema, which you can find in the doc subdirectory of the Data Transformation installation directory. The processor output is in the UTF-8 encoding. If a transformation receives input from the processor, you must set the input encoding of the project to UTF-8. For more information, see Encoding Properties on page 218. The processor support Excel version 97 and later. It accesses its input directly, not via Excel. You do not need to install Excel on the computer. The processor supports both the XLS format and the XLSX format introduced in Excel 2007. The processor is implemented in Java. If you experience any difficulty using the processor, confirm that you have configured the Java Runtime Environment (JRE) correctly. For more information about the JRE, see the Data Transformation Administrator Guide.
Table 5-1. Basic Properties
Property
Display_raw_data_when_different
Description Formatted data in the source file may appear differently from the raw data. For example, the raw data 1 appears as 1.00 if its cell is formatted as Number with two decimal places. If you enable this property, the processor includes both the raw data and the formatted data in its output, if they differ. If you disable the property, the processor includes only the formatted data.
ExcelToHtml
Note: This component is being phased out of the Data Transformation system. For backwards compatibility, the
Studio displays the component in existing projects that use it. It cannot be used in new projects. This document processor converts Microsoft Excel documents to HTML. The processor uses the Excel save-as-HTML feature to perform the conversion. It operates only on a Microsoft Windows platform where Excel version 97 or higher is installed. Due to Excel limitations, the processor does not support multithreading.
ExcelToTextML
Note: This component is being phased out of the Data Transformation system. For backwards compatibility, the
Studio displays the component in existing projects that use it. It cannot be used in new projects. This document processor converts Microsoft Excel files to the TextML XML schema. For more information, see TextML XML Schema on page 36. The processor support Excel version 97 and higher. It accesses its input directly, not via Excel. You do not need to install Excel on the computer. The processor is implemented in Java. If you experience any difficulty using the processor, confirm that you have configured the Java Runtime Environment (JRE) correctly. For more information the JRE, see the Data Transformation Administrator Guide.
26
ExcelToTxt
Note: This component is being phased out of the Data Transformation system. For backwards compatibility, the
Studio displays the component in existing projects that use it. It cannot be used in new projects. This document processor converts Microsoft Excel documents to plain text. The processor uses the Excel save-as-text feature to perform the conversion. It operates only on a Microsoft Windows platform where Excel version 97 or higher is installed. Due to Excel limitations, the processor does not support multithreading.
ExcelToXml
This document processor converts Microsoft Excel documents to XML. The XML preserves the data, formulas, formatting, and optionally the macro code that existed in the original Excel document. If only the data is required, consider using the ExcelToDataXml processor, which offers smaller output and better performance. The XML representation conforms to the ExcelToXml.xsd schema, which is in the doc subdirectory of the Data Transformation installation directory. The processor output is in the UTF-8 encoding. If a transformation receives input from the processor, you must set the input encoding of the project to UTF-8. For more information, see Encoding Properties on page 218. The processor support Excel version 97 and later. It accesses its input directly, not via Excel. You do not need to install Excel on the computer. The processor is implemented in Java. If you experience any difficulty using the processor, confirm that you have configured the Java Runtime Environment (JRE) correctly. For more information about the JRE, see the Data Transformation Administrator Guide.
Table 5-2. Basic Properties
Property
include_sheets
Description Defines the sheets of the Excel workbook to include in the XML. In the XML output, each sheet is represented by a <sheet> element. In the list under this property, you may enter any of the following values: - All: includes all sheets - The sheet names - Data holders containing the sheet names If you list a sheet that doesn't exist in the workbook, the processor generates a <sheet> element containing a warning message. The other sheets are processed normally. Deselect this property to omit empty cells from the XML. Select this property to include Excel macro code in the XML.
include_empty_cells include_macro_information
ExpandFrameSet
This document processor opens all the frames of an HTML document. This processor is appropriate if the source document of a parser is an HTML frameset. The parser runs on the content of all the frames.
ExternalCOMPreProcessor
Note: This component is being phased out of the Data Transformation system. For backwards compatibility, the
Studio displays the component in existing projects that use it. It cannot be used in new projects. For more information about custom processors, see the Data Transformation Engine Developer Guide.
Document Processor Component Reference 27
This component allows you to run a custom document processor. Because the component uses the Microsoft COM architecture to activate the processor, it runs only on Microsoft Windows platforms.
To create a custom COM processor: 1.
The in_file parameter is the content of the source document. The function returns the processed text.
2. 3. 4.
Register the DLL on the Data Transformation computer. Define an ExternalCOMPreProcessor that references the ProgID of the DLL. Optionally, add the ExternalCOMPreProcessor to the component list that Data Transformation Studio displays.
ExternalJavaPreProcessor
This component allows you to run a custom document processor that is implemented in Java.
Note: This component is supported for backwards compatibility with existing custom processors. For more
information about custom processors and other external components, see the Data Transformation Engine Developer Guide.
To create a custom Java processor: 1. 2. 3.
Create a new Java project and package, for example, named MyJavaPreprocessor. Create a class, for example, named JavaDemoPreprocessor. In the class, define a method having the following syntax. The method can have any name.
public static String main(String input_file, String output_file)
The input_file parameter is the path of the source document on which the processor should operate. The output_file parameter is the path of a temporary file where the processor should write its output. The function returns an extension that Data Transformation appends to the name of the temporary file. For example, if the output of the processor is XML, the function can return the string "xml".
4. 5. 6. 7.
Create a jar file containing the class. Store the jar file in the externLibs\user subfolder of your Data Transformation installation folder. Define an ExternalJavaPreProcessor that references the class and method. Optionally, add the ExternalJavaPreProcessor to the component list that Data Transformation Studio displays.
If you experience any difficulty using the processor, confirm that you have configured the Java Runtime Environment (JRE) correctly. For more information about the JRE, see the Data Transformation Administrator Guide.
Example
The following is the source code of a processor that repairs numeric values by removing commas between the numbers.
package MyJavaPreprocessor; import java.io.*;
28
public class JavaDemoPreprocessor { private static final int MAX_SIZE = 4096; public static String main(String input_file, String output_file){ try { FileInputStream in = new FileInputStream(input_file); FileOutputStream out = new FileOutputStream(output_file); int bytes_read=0; while(bytes_read != -1) { byte [] in_buf = new byte[MAX_SIZE]; byte [] out_buf= new byte[MAX_SIZE]; bytes_read = in.read(in_buf); int j = 0; for (int i=1;i<bytes_read;i++) { if (in_buf[i] == ',') { if (Character.isDigit((char)in_buf[i-1]) && Character.isDigit((char)in_buf[i+1])) { // Do Nothing } else out_buf[j++] = in_buf[i]; }else out_buf[j++] = in_buf[i]; } out.write(out_buf, 0, j); in.close(); out.close(); } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { } //return output file extension type return txt; } }
Description The path of the Java class, for example, MyJavaPreprocessor/JavaDemoPreprocessor. The method to run, for example, main.
Online Sample
For an online sample of the Java code, similar to the above example, see the following file in the Data Transformation installation folder:
samples\SDK\ExternalPreprocessor\External_JavaPreprocessor.java
ExternalPreProcessor
This component allows you to run a custom document processor that is implemented as a C++ DLL.
Note: This component is supported for backwards compatibility with existing custom processors. For more
information about custom processors and other external components, see the Data Transformation Engine Developer Guide. The following instructions are for the Microsoft Visual C++ compiler, running on a Microsoft Windows platform.
29
Copy the following file from the Data Transformation installation folder:
samples\SDK\ExternalPreprocessor\External_Preprocessor.cpp
2. 3.
Using the Visual C++ compiler, create a Win32 dynamic-link library project, and insert the C++ file into the project. Edit the following function:
declspec(dllexport) bool process_buffer(istream& in, ostream& out)
In the sample implementation, the function repairs numeric values, removing commas between the values. Replace the sample code with your implementation.
4. 5. 6. 7.
Compile the DLL. Store the DLL in the externLibs\user subfolder of the Data Transformation installation folder. Define an ExternalPreProcessor that references the DLL. Optionally, add the ExternalPreProcessor to the component list that Data Transformation Studio displays. For more information about customizing the component list, see Using Data Transformation Studio in Eclipse.
PdfFormToXml_1_00
This document processor converts PDF forms to XML. The processor supports forms that conform to the Adobe AcroForms standard, for example, forms that were created by Foxit or Open Office.
On the Studio menu, click File > New > Project. In the Data Transformation category, select an Import Project and click Next. Enter a name for the project. As the import type, select PdfForm. Browse to a sample PDF form file of the kind that you plan to process. Verify the path of the PDF form file and click Finish. The Studio creates a mapper project having the following characteristics:
The project is configured with a PdfFormToXml_1_00 processor. The output of the processor is the input of the mapper. The example source of the mapper is the sample PDF form that you selected. The project contains a schema for the XML output of the processor.
7.
If you plan to map the form data to another XML schema, configure the mapper.
30
8.
Edit the IntelliScript of the mapper project, changing the startup component from a mapper to a serializer using the same example source. Create an independent serializer project using the same example source. Add the schema file that you generated in the mapper project to the serializer project.
You can then use the PdfFormToXml_1_00 processor with the serializer.
Tip: To verify that PDF source documents conform to the schema, configure the mapper or serializer with the
PdfToTxt_4
This document processor converts PDF files to text or XML. The processor output is in the UTF-8 encoding. Since the parser receives input from the processor, you must set the input encoding of the project to UTF-8. For more information, see Encoding Properties on page 218. The processor provides a graphical table configuration editor. You can use the editor to improve the processing by defining table locations and column widths. For more information, see PdfToTxt_4 Table Configuration Editor on page 36.
PdfToTxt_4
does not require Adobe Acrobat or other PDF software to convert PDF files.
By default, the processor generates text output. Optionally, if you use the graphical table editor, you can select XML output. The XML conforms to the PDF4.xsd schema, stored in the doc subdirectory of the Data Transformation installation directory.
Table 5-6. Basic Properties
Property
PdfLayout
Description Defines the PDF table layout. Double-click the value of this property to open the table configuration editor.
Note: If you open a project that was created in a previous Data Transformation version, you might observe that
it uses an older PdfToTxt processor such as PdfToTxt_3_02. The older versions are supplied for backwards compatibility. In new projects, use PdfToTxt_4.
PowerpointToHtml
Note: This component is being phased out of the Data Transformation system. For backwards compatibility, the
Studio displays the component in existing projects that use it. It cannot be used in new projects. This document processor converts Microsoft PowerPoint documents to HTML. The processor uses the PowerPoint save-as-HTML feature to perform the conversion. It operates only on a Microsoft Windows platform where PowerPoint version 97 or higher is installed. Due to PowerPoint limitations, the processor does not support multithreading.
PowerpointToTextML
This document processor converts Microsoft PowerPoint ( *.PPT) presentations to the TextML XML schema. For more information, see TextML XML Schema on page 36. The processor supports PowerPoint version 97 and higher. It accesses its input directly, not via PowerPoint. You do not need to install PowerPoint on the computer. The processor is implemented in Java. If you experience any difficulty using the processor, confirm that you have configured the Java Runtime Environment (JRE) correctly. For more information about the JRE, see the Data Transformation Administrator Guide.
31
ProcessByTransformers
This component allows you to run transformers as document processors. The component runs a transformer or a sequence of transformers on the entire document, as opposed to the normal transformer usage, which is to run on the text retrieved by an anchor. A transformation can then run on the output of the transformers. Data Transformation offers a large number of transformers. Hence, the ProcessByTransformers component greatly expands the set of processing operations that you can apply to a document. For more information, see Transformers on page 105.
Table 5-7. Basic Properties
Property
transformers
ProcessorPipeline
This component allows you to run a sequence of document processors on a document. A transformation can run on the output of the sequence. Within this component, enter the sequence of processors.
RtfToTextML
This document processor converts RTF files to the TextML XML schema. For more information, see TextML XML Schema on page 36. The processor output is in the UTF-8 encoding. If a transformation receives input from the processor, you must set the input encoding of the project to UTF-8. For more information, see Encoding Properties on page 218.
WordPerfectToTextML
Note: This component is being phased out of the Data Transformation system. For backwards compatibility, the
Studio displays the component in existing projects that use it. It cannot be used in new projects. This document processor converts Corel WordPerfect documents to the TextML XML schema. For more information, see TextML XML Schema on page 36. The processor output is in the UTF-8 encoding. If a transformation receives input from the processor, you must set the input encoding of the project to UTF-8. For more information, see Encoding Properties on page 218. WordPerfect does not need to be installed on the computer.
WordToHtml
Note: This component is being phased out of the Data Transformation system. For backwards compatibility, the
Studio displays the component in existing projects that use it. It cannot be used in new projects. This document processor converts Microsoft Word documents to HTML. The processor uses the Word save-as-HTML feature to perform the conversion. It operates only on a Microsoft Windows platform where Word version 97 or higher is installed. Due to Word limitations, the processor does not support multithreading.
32
WordToRtf
Note: This component is being phased out of the Data Transformation system. For backwards compatibility, the
Studio displays the component in existing projects that use it. It cannot be used in new projects. This document processor converts Microsoft Word documents to RTF. The processor uses the Word save-as-RTF feature to perform the conversion. It operates only on a Microsoft Windows platform where Word version 97 or higher is installed. Due to Word limitations, the processor does not support multithreading.
WordToTextML
Note: This component is being phased out of the Data Transformation system. For backwards compatibility, the
Studio displays the component in existing projects that use it. It cannot be used in new projects. This document processor converts Microsoft Word files to the TextML XML schema. For more information, see TextML XML Schema on page 36. The processor supports Word version 97 and higher. It accesses its input directly, not via Word. You do not need to install Word on the computer. The processor is implemented in Java. If you experience any difficulty using the processor, confirm that you have configured the Java Runtime Environment (JRE) correctly. For more information about the JRE, see the Data Transformation Administrator Guide.
WordToTxt
Note: This component is being phased out of the Data Transformation system. For backwards compatibility, the
Studio displays the component in existing projects that use it. It cannot be used in new projects. This document processor converts Microsoft Word documents to plain text. The processor uses the Word save-as-text feature to perform the conversion. It operates only on a Microsoft Windows platform where Word version 97 or higher is installed. Due to Word limitations, the processor does not support multithreading. The output is encoded according to the system code page.
WordToXml
This document processor converts Microsoft Word documents to XML. The processor output is in the UTF-8 encoding. If a transformation receives input from the processor, you must set the input encoding of the project to UTF-8. For more information, see Encoding Properties on page 218. The processor supports Word version 97 and higher. It accesses its input directly, not via Microsoft Word. You do not need to install Word on the computer. The processor is implemented in Java. If you experience any difficulty using the processor, confirm that you have configured the Java Runtime Environment (JRE) correctly. For more information about the JRE, see the Data Transformation Administrator Guide.
Note: In Data Transformation 3.2, this processor generated XML in the ISO-8859-1 encoding. If you upgrade a
Data Transformation 3.2 project that uses the processor, you might need to edit the input and working encodings. For more information, see Encoding Properties on page 218.
33
XmlToDocument
Note: This processor is not included in the Data Transformation setup. To obtain the XmlToDocument add-on package, contact Informatica. Install the add-on on all computers where you plan to run transformations using XmlToDocument.
This document processor converts XML data to document formats such as PDF, Word, Excel, PowerPoint, HTML and PostScript. You can use it, for example, as a postprocessor to convert parser or mapper output to an easily readable document. The processor uses an Eclipse add-on called the Business Intelligence and Reporting Tool (BIRT) to generate the output documents. In BIRT, you must configure a report that converts the XML to the desired document format. The XmlToDocument processor runs the report. A copy of BIRT is included in the XmlToDocument installation package. For more information about BIRT, see http://www.eclipse.org/birt.
Table 5-8. Basic Properties
Property
report_file
Description The BIRT *.rptdesign file. The Studio copies the file to the project folder. The Data Transformation Explorer displays the file under the Additional folder. The format of the output document. Enter one of the following values: - pdf. PDF document. - doc. Microsoft Word document. - xls. Microsoft Excel workbook. - ppt. Microsoft PowerPoint presentation. - html. HTML web page. - ps. PostScript document. The location of the *.rptdesign file, by default, the project directory. Optionally, you can change the location to a another directory where you store *.rptdesign files.
output_format
report_location
In Data Transformation Studio, create a transformation and configure it to generate the XML. Run the project, generating a sample XML output file. In the following steps, you will use the sample XML file to configure the BIRT report. You will also use the XSD schema that is defined in the project.
3.
On the Studio menu, click Window > Open Perspective > Other > Report Design. This opens the BIRT perspective where you can design a report.
4.
On the menu, click File > New > Report. This opens the BIRT New Report wizard.
34
5. 6.
At the prompt, store the report in the project directory, and specify a report name. Select a report type such as a Simple Listing. The wizard displays an illustration of the report. Some report types include tables and charts in addition to text.
7. 8. 9.
Click Finish to exit the Wizard and display the report editor. Confirm that the Layout tab of the report editor is displayed. Follow the instructions displayed in the Cheat Sheet view to configure the report features. See the BIRT documentation, available at http://www.eclipse.org/birt, for a full explanation of the procedures. In brief, you must configure:
Data source. Select the option for an XML Data Source. At the prompt, select the XML sample file and the XSD schema. Data set. The set of XML elements and attributes that will be in the report output. At the prompts, set the table mapping to the \Person element, and set the output columns to First, Last, Id, and Age. Data binding. The location of each element and attribute in the output. Drag the output columns to the first row of the layout.
10.
To view the sample report, display the Preview tab of the report editor.
In this example, the report has a single line. If Person were a multiple-occurrence data holder, the report would have multiple lines. You can edit the configuration if required. BIRT saves the configuration as a *.rptdesign file.
11. 12. 13. 14. 15.
Return to the Data Transformation Studio Authoring perspective. At the location of the IntelliScript where you want to output a document, insert a WriteValue action. Set the input property of WriteValue to the input element of the report. Set the output property to the location of the document output. In the transformers property of WriteValue, add a TransformByProcessor transformer. Configure TransformByProcessor to run an XmlToDocument processor, and configure its properties.
16.
Run the project. In addition to the regular XML output, the transformation generates the PDF output. You can view the PDF in the Adobe Reader.
35
XmlToExcel
This document processor converts XML documents to Microsoft Excel format. The processor operates on an XML representation of an Excel workbook. The XML representation must be in the UTF-8 encoding and it must conform to the ExcelToXml.xsd schema. You can find the schema in the doc subdirectory of the Data Transformation installation directory. The schema file is provided for your information. You can use the processor without adding the schema to your project. The processor reverses the operation of ExcelToXml. For example, you can use ExcelToXml to convert an Excel workbook to XML. You can then alter some of the XML data and use XmlToExcel to convert the data back to an Excel workbook. The processor support Excel version 97 and higher. It writes its output directly, not via Microsoft Excel. You do not need to install Excel on the computer. The processor is implemented in Java. If you experience any difficulty using the processor, confirm that you have configured the Java Runtime Environment (JRE) correctly. For more information about the JRE, see the Data Transformation Administrator Guide.
36
helps overcome these issues by providing a graphical table configuration editor. This section explains how to use the editor.
Note: If you use the processor with PDF documents that do not contain tables, or if the default table processing
is sufficient, you do not need to use the graphical table editor. In that case, you can skip the instructions in this section.
To configure the table processing: 1. 2.
In the IntelliScript, configure an example_source that is a PDF file, and insert a PdfToTxt_4 processor. Nested within the PdfToTxt_4 processor, double-click the value property. The Studio opens the PDF table configuration editor. The upper portion of the editor displays the input PDF document. The lower portion displays the PdfToTxt_4 output. Table editing commands are available on the toolbar at the top of the screen. You can right-click to display an editing menu.
3.
Browse to a table in the PDF document and click Add Table. The system displays the name of the table in the Tables field and in the Name field.
4.
Define the start of the table by entering a regular expression in the Table Start field. The expression must define the upper left corner of the table.
Tip: Use the headings of the first two columns as the regular expression. Add more column headings as
needed to make Table Start unique. Separate the headings by a single space character, even if the columns are widely separated. You can use the ^ and $ symbols to force the regular expression to match the start and end of a line. For more information about regular expression syntax, see RegularExpression on page 125. To use regular expressions, you must select the Use Regular Expressions checkbox. For example, if the first two column headings are GID and RMS ID, enter the value GMS RMS ID in the Table Start field. If the string GMS RMS ID might occur elsewhere in the document, enter ^GMS RMS IS.
5.
Define the end of the table by entering a regular expression in the Table End field. The expression must define the text that immediately follows the table. For example, the first few words of body text
37
immediately following the table might be a good definition for Table End. The value of Table End must appear in the body of the document, not in a page footer.
6.
Click Process. The editor displays the table configuration that PdfToTxt_4 detects. The top and bottom of the table appear as horizontal blue lines. The default column borders appear as vertical red lines.
7.
Edit the column borders by dragging them left or right as required. To add a column border, click Add Column, and position it in the table. To delete a column border, click Remove Column and then click the border to remove.
Note: If the table contains horizontally merged cells, PdfToTxt_4 might truncate the entries.
8. 9. 10.
Examine the output window to confirm that the table is converted properly. If not, correct the table definitions. Repeat these steps for each table in the PDF document. Click OK to return to the Studio. The value property of the PdfToTxt_4 processor now contains an XML string that defines the table configuration. The example pane displays the PdfToTxt_4 output. You can continue configuring the transformation in the IntelliScript.
38
Editor Options
The following table describes the controls and fields in the PdfToTxt_4 table configuration editor.
Control or Field Zoom In Zoom Out Fit Width Prev Page Next Page Find Add Table Rem. Table Add Column Rem. Column Process Tables Name Table Start Table End Page Header Page Footer Use Regular Expressions Recalculate at Runtime Description Make the PDF display larger. Make the PDF display smaller. Display the PDF document according to the width of the window. Go to the previous page. Go to the next page. Search for a string in the PDF. Add a table to the configuration. Remove a table from the configuration. Add a column border to the current table. Delete the currently selected column border. Apply the current table definitions. Click Process after every table and column-related action to apply that action. A list of tables defined in the input PDF. You can select a table by clicking it. Name of the currently selected table. An expression defining the upper left corner of the table. An expression defining the first text after the table. An expression defining the end of the page header. Use this option to exclude the header from the table processing. An expression defining the end of the page footer. Use this option to exclude the footer from the table processing. If selected, the processor interprets the Table Start, Table End, Page Header, and Page Footer as regular expressions and searches for matching text. If not selected the processor interprets these fields as literal text. If you select this option, PdfToTxt_4 ignores the table configurations that you specified using the table configuration editor. This feature is useful if the tables in a PDF are simple enough for the PdfToTxt_4 to process without special configuration. For example, suppose a simple PDF financial statement contains a table whose columns may vary slightly from month to month. Select the Recalculate at Runtime option to have PdfToTxt_4 adjust the column widths at runtime. If you have changed the table definition, for example by changing column borders or adding a Page Header or Page Footer, click Recalculate Now to update the table definition. Number of the PDF page that is currently displayed. Generates the PdfToTxt_4 output as XML instead of text. Enter a character to use as the column separator in the text output. The default is a vertical bar (|). Click to save the table configuration and return to the Studio. Click to return to the Studio project without saving the table configuration. The table navigation aid displays the number of times a table is found in the PDF document. An example of a navigation aid is Table Table 1 found 2 times. The arrows next to this information let you jump back and forth among the instances of the same table structure.
Recalculate Now
39
Set Table Start = GID RMS ID, the headings of the first two columns of the table. Note that the expression is case sensitive. Set Table End = Forward exchange transactions, the first text following the table. The editor displays the table configuration:
If necessary, adjust the table definition and the columns. You can drag, add, or remove column borders. The following figure shows the text output.
40
Click Add Table. The system displays Table 2 in the Tables and Name fields. Set Table Start = Ticker Shares Traded Set Table End = Conclusion, the first body text after the table. Click Process to configure the table. Adjust the right borders of the Shares Traded and Currency columns.
A fragment of the formatted table is shown below. Notice how the page header and footer appear on each page of the formatted document, breaking the table into sections.
We can eliminate the page header and footer from the output document as follows:
Set Page Header = Gain/Loss Set Page Footer = Page [1-9] Click Process
These options remove the page header and footer from the formatted table:
41
To view the converted document, right-click the parser and click Open Example Source. The example source document appears in text format.
Alternatively, return to the table configuration editor and select the Output as XML option. The processor output is now displayed as XML.
42
CHAPTER 6
Formats
This chapter includes the following topics:
Defining Document Formats, 43 Standard Properties of Formats, 44 Format Component Reference, 45 Delimiters Component Reference, 47 Delimiter Subcomponent Reference, 51 Format Preprocessor Component Reference, 52
43
The format has properties of its own, which further define how the parser should interpret and process the input. Within a format, you can nest the following subcomponents:
Subcomponent Delimiters Format preprocessors Default transformers Description A hierarchy of characters or strings that organize the information in the document, such as newlines and tabs. Components that cleans up the source before the parser starts searching for anchors. Transformers that the parser applies to the output of each anchor.
By configuring the properties and subcomponents, you can support an extraordinarily broad range of source documents. This chapter describes the formats, delimiters, and format preprocessors that are available for your use. The subject of default transformers is discussed briefly here. For more information about default transformers, see Transformers on page 105.
Description A hierarchy of characters or strings that organize the information in the document, such as newlines, spaces, tabs, commas, or vertical bars. You can also use a wildcard pattern to define the delimiters. The delimiter concept also encompasses positionally-structured data, where the fields are located at fixed offsets from one another. The value of the property is a delimiters component. For more information, see Delimiters Component Reference on page 47. An optional format preprocessor that converts the source to a format that the parser can process. The format preprocessor acts on the source after any document processor that you defined. The purpose of the format preprocessor is to clean up whitespace or markup before the parser starts to search for anchors. For more information, see Format Preprocessor Component Reference on page 52. A list of transformers that the parser applies in sequence to the output of each anchor. The purpose of the transformers is typically to clean up the output and remove markup codes. For more information, see Transformers on page 105.
pre_processor
default_transformers
44
Chapter 6: Formats
Description A name that you assign to the format. The name is displayed in the event log. A comment describing the format.
BinaryFormat
This format is suitable for parsing binary files. It is also suitable for text files that you want to treat as a buffer of binary bytes. By default, the delimiters property of this component has a value of Positional. The pre_processor and default_transformers properties are empty. For more information about the properties, see Standard Properties of Formats on page 44.
CustomFormat
This is a generic format, which you can use to process any type of source document. By default, the delimiters, pre_processor, and default_transformers properties of this component are empty. You must configure the properties yourself. For more information about the properties, see Standard Properties of Formats on page 44.
Example
A source document has the following structure:
Ron Evelyn Lehrer && 547329876:27 Kern && 9875424: 53
Each line of the document is a record containing a person's name, ID number, and age. The fields are separated by the symbols && and:. The fields contain multiple space characters at random locations. One way to parse this document is by using CustomFormat. In the delimiters property of the format, assign a DelimiterHierarchy containing the symbols:
newline && :
45
In the default_transformers property, assign the HtmlProcessor, which removes the extra spaces from the output.
HtmlFormat
This format is suitable for parsing HTML files. It is also suitable for processing Microsoft Office documents. For this purpose, assign a document processor such as WordToHtml or ExcelToHtml to convert the Office document to HTML. By default, the delimiters property of this component has a value of SGML. This causes the format to recognize the HTML delimiters such as < and >. The pre_processor is HtmlProcessor. The default_transformers are:
RemoveTags.
Removes HTML tags from the output Converts HTML entities such as < and " to their plain text equivalents <
HtmlEntitiesToASCII.
Normalizes the whitespace, reducing any sequence of tabs, newlines, and spaces to a single Removes leading and trailing space.
space character.
RemoveMarginSpace.
For more information about the properties, see Standard Properties of Formats on page 44.
RtfFormat
This format is suitable for parsing RTF files. By default, the delimiters property of this component has a value of RTF, which recognizes the standard RTF delimiter characters such as \. The pre_processor is RtfProcessor. The default_transformers are:
RtfToASCII.
Removes RTF control words from the output. Removes RTF formatting instructions from the text. Normalizes the whitespace, reducing any sequence of tabs, newlines, and spaces to a single Removes leading and trailing space.
RemoveRtfFormatting. HtmlProcessor.
space character.
RemoveMarginSpace:
For more information about the properties, see Standard Properties of Formats on page 44.
TextFormat
This format is suitable for parsing text files. In combination with a document processor, this format is also suitable for processing other types of documents. For example, you can use the PdfToTxt_4, WordToTxt, or ExcelToTxt processor to process PDF, Microsoft Word, or Microsoft Excel documents with this format.
46
Chapter 6: Formats
By default, the delimiters property of this component has a value of DelimiterHierarchy, which allows you to define your own set of delimiters. The pre_processor is empty. The default_transformers are:
HtmlProcessor.
Normalizes the whitespace, reducing any sequence of tabs, newlines, and spaces to a single Removes leading and trailing space.
space character.
RemoveMarginSpace.
For more information about the properties, see Standard Properties of Formats on page 44.
XmlFormat
This format is suitable for parsing XML files. Parsing an XML file means converting an XML source document to an XML output document. To do this, Data Transformation treats the source XML as ordinary text. You can define delimiters, anchors, and other components just as you do for a regular text document. This behavior is different from that of serializers or mappers that process XML documents. In serialization and mapping, Data Transformation uses the XSD schema and the formal XML syntax rules to interpret the source document. By default, the delimiters property of this component has a value of SGML. This causes the format to recognize the XML delimiters such as < and >. The pre_processor is HtmlProcessor. The default_transformers are:
RemoveTags. Removes XML tags from the output. HtmlEntitiesToASCII. Converts XML entities such as < and > to their plain text equivalents, < and > respectively. Normalizes the whitespace, reducing any sequence of tabs, newlines, and spaces to a single Removes leading and trailing space. space character.
HtmlProcessor.
RemoveMarginSpace.
For more information about the properties, see Standard Properties of Formats on page 44.
You might define a Content anchor that is located two tab characters after the preceding Marker anchor in the example source, like this:
MARKER<tab>abc<tab>CONTENT
47
When Data Transformation processes a source document, it searches for the Content two tabs after the Marker. In a second example, you might define a Content anchor that is located three newlines and one tab after a Marker anchor, in the example source.
MARKER abc<tab>de fghi<tab>jkl<tab>mnop pqrst<tab>CONTENT
Within the intermediate lines, the tabs are not counted because the newlines are higher in the hierarchy. Many of the delimiters components, such as TabDelimited or CommaDelimited, display a predefined hierarchy of delimiters, which you can edit as required.
The DelimiterHierarchy component does not have a predefined hierarchy. You can insert whatever delimiters you need. Some delimiter components, such as SGML or PostScript, have a built-in hierarchy that you cannot edit.
CommaDelimited
This delimiters component defines the following delimiter hierarchy:
Newline Comma CommaDelimited is suitable if each line of a text file contains a record, and each record contains data fields separated by commas.
You can add additional delimiters or edit the predefined hierarchy. The procedure is the same as for the DelimiterHierarchy component.
Example
In the example source document, suppose that a Content anchor follows a Marker anchor by two lines. In the third line, there are three commas, plus any other text, before the Content anchor, like this:
MARKER abcdef, ghij abc, def,ghi,CONTENT
If you assign the CommaDelimited component, the parser learns from the example source that the Content anchor always follows the Marker by two newlines and three commas. In another source document, the parser will successfully find the following Content anchor:
MARKER xyz, uvw, rst ,,,CONTENT
DelimiterHierarchy
This delimiters component allows you to define a custom delimiter hierarchy. Under DelimiterHierarchy, you can nest any number of Delimiter or EnclosingDelimiters components.
48
Chapter 6: Formats
Example
In the example source document, suppose that the anchors are separated by commas and surrounded by brackets, like this:
MARKER,,[CONTENT]
From this example, the parser learns that the Content anchor follows the Marker by two commas and is surrounded by brackets. In another source document, the parser will successfully find the following Content anchor:
MARKER,abc,def[CONTENT]
Online Sample
For an online sample, see samples\Projects\EDI\EDI.cmw. The sample uses a DelimiterHierarchy to define the newline and * characters as delimiters, in an EDI source document.
HL7
This delimiters component defines the hierarchy that is used for parsing HL7 messages:
newline vertical bar (|) caret (^) or tab
You can add additional delimiters or edit the predefined hierarchy. The procedure is the same as for the DelimiterHierarchy component. The HL7 messaging standard permits a message to define its own delimiters. You can parse the delimiter declaration of an HL7 message and create a dynamic delimiter definition in the following way: 1. 2. 3. 4. Use Content anchors to retrieve the delimiter characters from the HL7 message header. Store the characters in variables. Add Delimiter components under the HL7 component. To each Delimiter component, assign TextSearch. Under the TextSearch component, assign one of the variables to the text property.
Positional
This delimiters component specifies that the parser should interpret the source document without using delimiters. Instead, it should locate each anchor by counting the characters from the beginning of the search scope. For more information about search scope, see Anchors on page 71.
Example
In the example source document, suppose that a Content anchor follows a Marker anchor by five characters, possibly including spaces, tabs, and so forth:
MARKERab cdCONTENTefg
If you assign the Positional component, the parser learns from the example source that the Content anchor always follows the Marker by five characters, and that it is seven characters long. In another source document, the parser will successfully find the following Content anchor:
MARKERd<tab>cbaCONTENTzy,xwv
49
PostScript
This delimiters component defines a delimiter hierarchy that is used for parsing Adobe PostScript documents. You cannot edit the delimiter hierarchy of the PostScript component.
RTF
This delimiters component defines a delimiter hierarchy that is used for parsing RTF documents. You cannot edit the delimiter hierarchy of the RTF component.
SGML
This delimiter component defines a delimiter hierarchy that is used for parsing SGML documents. It is often used for parsing HTML and XML, which are derivatives of SGML. You cannot edit the delimiter hierarchy of the SGML component.
SpaceDelimited
This delimiters component defines the following delimiter hierarchy:
Newline String of one or more space characters SpaceDelimited is suitable if each line of a text file contains a record, and each record contains data fields separated by spaces.
You can add additional delimiters or edit the predefined hierarchy. The procedure is the same as for the DelimiterHierarchy component.
Example
In the example source document, suppose that a Content anchor follows a Marker anchor by two lines. In the third line, there are two space characters and one string containing multiple spaces before the Content anchor, like this:
MARKER abcdef abc def ghi
CONTENT
If you assign the SpaceDelimited component, the parser learns from the example source that the Content anchor always follows the Marker by two lines and three strings of spaces. In another source document, the parser will successfully find the following Content anchor:
MARKER xyz ghi def abc CONTENT
TabDelimited
This delimiters component defines the following delimiter hierarchy:
50
Chapter 6: Formats
Newline Tab TabDelimited is suitable if each line of a text file contains a record, and each record contains data fields separated by tabs.
You can add additional delimiters or edit the predefined hierarchy. The procedure is the same as for the DelimiterHierarchy component.
Example
In the example source document, suppose that a Content anchor follows a Marker anchor by two lines. In the third line, there are three tab characters, plus any other text, before the Content anchor, like this:
MARKER abcdef abc<tab> de,f<tab>ghi<tab>CONTENT
If you assign the TabDelimited component, the parser learns from the example source that the Content anchor always follows the Marker by two lines and three tabs. In another source document, the parser will successfully find the following Content anchor:
MARKER xyz <tab><tab><tab>CONTENT
Delimiter
This subcomponent defines a delimiter character or string that separates anchors. You can add Delimiter subcomponents within a delimiter hierarchy.
Example
The TabDelimited component contains two Delimiter subcomponents. The first uses NewlineSearch to define the newline character as a delimiter. The second uses TextSearch to define the tab character as a delimiter. The tab is graphically represented as a character.
The SpaceDelimited component also contains two Delimiter subcomponents. The first is identical to that of TabDelimited. The second uses a PatternSearch to define any string of one or more spaces as a delimiter. The regular expression []+ means one or more space characters.
51
Description The delimiter definition. The value is one of the following searcher components. For more information, see Searcher Component Reference on page 98. - NewlineSearch . The delimiter is a newline. - PatternSearch . The delimiter is defined by a regular expression. - TextSearch. The delimiter is an explicit string or a string that you retrieve dynamically from the source document.
EnclosingDelimiters
This subcomponent defines a pair of delimiter characters or strings, which surround anchors. You can add EnclosingDelimiters subcomponents within a delimiter hierarchy. For example, the component is useful, to define the {} delimiters that surround blocks of C program code.
Table 6-4. Basic Properties
Property
opening closing escape_sequence
Description The opening delimiter. The closing delimiter. A prefix in the source document, such as a backslash character \, which causes the parser to ignore an instance of the opening or closing delimiter.
You can assign a document processor to the pre_processor property of an input port, located under the example_source or sources_to_extract property of a transformation. You can assign a format preprocessor only to the pre_processor property of a format. Data Transformation runs a document processor on the source document before it performs any other operations. The example pane of the IntelliScript editor displays the output of the document processor. Data Transformation runs a format preprocessor on the text that appears in the example pane, before it searches for anchors. The output of the format preprocessor is not displayed.
52
Chapter 6: Formats
HtmlProcessor
This format preprocessor, which is also available as a transformer, normalizes whitespace according to HTML conventions. It reduces any sequence of tabs, line breaks, and space characters to a single space character. You can use this preprocessor to normalize whitespace in any type of text. It is not restricted to HTML documents. For more information, see Transformers on page 105.
RtfProcessor
This format preprocessor normalizes the code of RTF files. For more information, see Transformers on page 105.
53
54
Chapter 6: Formats
CHAPTER 7
Data Holders
This chapter includes the following topics:
Overview, 55 XSD Schemas, 55 Adding XSD Schemas to a Project, 58 Viewing a Schema, 59 Using a Schema to Map Anchors, 60 Generating Valid XML, 62 Variables, 64 Variable Component Reference, 67 Multiple-Occurrence Data Holders, 67
Overview
A data holder is an object that has one of the following types:
XML elements and attributes are typically used for permanent storage. A parser, for example, stores its output in data holders of these types. Variables are used for temporary storage. For example, a parser can store data that it extracts from a source document in a variable. It can process the data further before creating the output. A common feature of all data holders is that they have XSD data types. In the case of elements and attributes, the data holders are defined in an XSD schema that you must supply. Variables are defined in an internal schema, which you can customize by adding user-defined variables.
XSD Schemas
When you create a parser, serializer, or mapper, you must supply one or more XSD schemas that define the structure of the XML. The schema defines the elements and attributes that the transformation can use.
55
You must add the schema to your project. You can then map the content of a document to elements and attributes that are defined in the schema.
About XSD
XSD is the commonly-used name for XML Schema, which is the industry-standard language for XML schema definitions. XSD originally stood for XML schema description, but this term is not used in the official XML Schema standard. The schema files typically have *.xsd filenames. The XSD standard is maintained by the World Wide Web Consortium, http://www.w3.org. Since the standard was first published in 2001, XSD has rapidly replaced earlier schema languages such as DTD and XDR. The following is a simple example of an XSD schema:
<?xml version="1.0" encoding="Windows-1252"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="Person"> <xs:complexType> <xs:sequence> <xs:element name="Name" minOccurs="0"> <xs:complexType> <xs:sequence> <xs:element name="First" minOccurs="0" type="xs:string"/> <xs:element name="Last" minOccurs="0" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="Id" minOccurs="0" type="xs:string"/> <xs:element name="Age" minOccurs="0" type="xs:string"/> </xs:sequence> <xs:attribute name="gender" type="xs:string"/> </xs:complexType> </xs:element> </xs:schema>
The schema defines the elements and attributes that can occur in an XML document. The syntax lets a schema author specify the hierarchy and sequence of elements, whether the elements are mandatory or required, their data types, their possible values, and many other features. The above sample schema defines an XML structure such as the following:
<Person gender="M"> <Name> <First>Ron</First> <Last>Lehrer</Last> </Name> <ID>547329876</ID> <Age>27</Age> </Person>
If you trace through the schema, you can observe the correspondence between definitions such as
<xs:element name="Person">
or
<xs:attribute name="gender" type="xs:string"/>
and the elements and attributes of the XML. The elements and attributes have XSD data types, such as xs:string. An element that contains nested elements or attributes has a type of xs:complexType. The elements have many other properties, such as their required sequence and the minimum number of times that must occur in an XML document, minOccurs.
56
For more information about XSD syntax, see the following websites:
URL
http://www.w3.org
Description The web site of the World Wide Web Consortium, which created and maintains the XML Schema standard. See this site for an excellent tutorial introduction to XSD.
http://www.w3schools.com
The schema encoding is identical to the working encoding, or Every character in the schema has an equivalent in the working encoding. For example, if the schema uses the UTF-8 encoding, and the working encoding is Windows-1252, the schema must not contain Unicode characters that have no Windows-1252 equivalent.
When you add a schema from an external location to a project, Data Transformation translates the project copy of the schema to the working encoding.
Namespaces
If you plan to work with XML namespaces, assign the targetNamespace attribute of the schema. In Data Transformation Studio, you can edit the alias that is assigned to the namespace. For more information, see Namespaces Properties on page 222. To prevent ambiguities, Data Transformation Studio displays a warning if you try to add two schemas that use the same alias for different namespaces. It assigns a different alias to one of the namespaces. The Studio also warns if you try to add two schemas that use an empty alias for different namespaces.
Mixed Content
Data Transformation supports XML elements that have mixed content, that is, elements containing both character data and nested elements. You can use the mixed attribute in a schema. Data Transformation distinguishes between character data before and after each element. For more information, see Mapping Mixed Content on page 61.
XSD Schemas
57
Model groups in place of XML entities, for example: <city>Montr<c:eacute/>al</city>, are not supported and should not be used. Other uses of groups are supported. Data holders having the XSD type long or unsignedLong currently support integers with absolute values up to 2147483647. Larger values are not supported and may give incorrect results.
long unsignedLong
In the Data Transformation Explorer view, right-click the XSD node of a project and click Add File. Browse to the XSD file. If the XSD file is not in your project folder, Data Transformation copies the file to the project folder. The XSD folder of the Data Transformation Explorer displays the schema file. If the schema references any other XSD files, the Include folder of the Data Transformation Explorer displays their names. Optionally, if the schema defines a target namespace, you can edit the namespace alias in the Data Transformation project properties. For more information, see Namespaces Properties on page 222.
In the Data Transformation Explorer, right-click the XSD node and click New > XSD. The Data Transformation Explorer displays the new file with a default name such as untitled1.xsd.
2.
Rename the file immediately. This is important to prevent errors in the references that a project creates to the schema. Data Transformation Studio does not allow you to change the name of an existing schema file.
3.
58
This opens the schema in an editor window. You can configure the Studio to open an editor of your choice. For more information about the XSD editor configuration, see Using Data Transformation Studio in Eclipse.
Viewing a Schema
In the Schema view of Data Transformation Studio, you can view the namespaces, elements, and attributes of all schemas that belong to a project. The namespace listing includes:
Default entries for the Worldwide Web Consortium schema namespaces, beginning with http:// www.w3.org. In most projects, you can ignore these entries. A Variables namespace, containing the variables defined in your project. An entry for each target namespace that is defined in the schemas that you have added to the project.
namespace
If you add one or more schemas that do not define a target namespace, they are displayed under the no target heading.
Viewing a Schema
59
Right-click the schema in the Data Transformation Explorer. Click Create Example XML. The data type of an element or attribute. The sample displays "a" for string data, "1" for integer data, and "1.1" for floating point. The multiplicity of an element. If an element can occur more than once, the sample displays it more than once.
Alternatively, when you configure a parser, you can drag text from the example source to a data holder in the Schema view. Data Transformation Studio creates a Content anchor that maps the selected text to the data holder.
60
Do not attempt to type this value. If you wish to modify the mapping, select the data_holder property and press Enter. This opens a Schema view, where you can select the new mapping.
Name/First.
The Data Transformation XPath syntax is slightly different from the standard XPath syntax, which is Person/ Data Transformation inserts *s, *c, and *a, which refer to the XSD terms sequence, choice, and all. The modifications resolve ambiguities when Data Transformation uses XSD to construct XML output.
Data Transformation considers this structure to contain data holders in the following locations:
Immediately after the <Deal> tag, before any of the sub-elements. Before the Price element The Price element After the Price element Before the Partner element The Partner/Name and Partner/ID elements After the Partner element Immediately before the </Deal> tag, after all the sub-elements.
You can map the text " We are pleased to offer you a price of" to the data holder before the Price element. You can map "dollars. " to the data holder after Price, and "This is a special price for " to the data holder before Partner. The Schema view displays the mixed-content data holders.
61
data_holder = /Deal/*s/Price/$text_before
A schema can define derived XSD data types that can be used in place of a base type. In such cases, an XML document can define the actual data type of an element by specifying an xsi:type attribute. For example, an XSD schema defines a Person element having a type PersonT1 and containing string content. It defines a type called PersonT2 that extends PersonT1 by adding an Id attribute. The following are valid Person elements:
<!-- base type PersonT1 --> <Person>Ron Lehrer</Person> <!-- derived type PersonT2 --> <Person Id="547329876" xsi:type="PersonT2">Ron Lehrer</Person>
Data Transformation interprets xsi:type attributes in input XML documents. It adds xsi:type attributes where necessary to output XML documents. When you map a data holder to an element that can have multiple types, the Schema view displays the types.
Select the appropriate type according to the data that the transformation processes. For example, if you want a Content anchor to store data in a Person element having type PersonT2, select xsi:type=PersonT2. The IntelliScript displays the selection as follows:
data_holder=/Person/*c/xsi:type=PersonT2
In cases where the content might require either a PersonT1 or PersonT2 data holder, you can configure an Alternatives anchor that contains two Content anchors. One of the Content anchors is mapped to PersonT1, and the other to PersonT2. For more information, see Alternatives on page 83. If you map a data holder to the unqualified element Person, the data holder defaults to the base type PersonT1. Thus the following mappings are equivalent:
data_holder=/Person data_holder=/Person/*c/xsi:type=PersonT1
62
Sequence of Elements
When Data Transformation runs a parser, it organizes the output in the sequence that is required by the XSD schema. For example, a schema may require that a LastName element precede a FirstName element. Data Transformation creates the output in the locations defined by the schema, even if the anchors that produce the output are defined in the opposite sequence.
Number of Occurrences
A parser may attempt to insert multiple instances of an element in the output XML. Data Transformation uses the schema to determine whether the new instances should be appended or should overwrite the existing elements. The parser deletes any excess elements beyond those that the schema permits, and it writes warnings in the event log. In another example, suppose the schema defines an element without specifying a minOccurs or maxOccurs attribute. According to the XSD standard, the default minOccurs and maxOccurs values are 1, which means that the element must occur exactly once in the parser output. If the element is missing from the output, the parser can add it. For more information, see Multiple-Occurrence Data Holders on page 67.
Partial validation. Some deviations are allowed between the XML source document and the schema. Strict validation. The XML source document must conform strictly to its schema.
To define the validation level, assign the validate_source_document property of the Serializer or Mapper component. If you use the strict mode, a validation error causes the serializer or mapper to fail. The Events view displays the errors. If you use the partial mode, the transformation might proceed despite certain validation errors. For example, if there are more occurrences of an element than the schema permits, a serializer typically ignores the excess elements and processes the valid ones, and it writes a warning in the events log. Similarly, it might ignore an element containing an invalid data type.
63
Data Transformation uses the Xerces C XML parser, version 2.7, to perform validation. For more information about the validation characteristics, see http://xerces.apache.org/xerces-c.
Note: Data Transformation 4.4 and earlier used partial validation. For compatibility with existing
transformations developed in these earlier versions, partial validation remains the default.
Variables
Variables are temporary data holders that you can use in place of XML elements or attributes. Variables are useful if you need to store a value temporarily during the operation of a transformation, and you do not need to output the value in the XML. For example, suppose you want a parser to read two Content anchors and concatenate their values. You might map each Content anchor to a user-defined variable. You can then use an action to concatenate the variables and output the result to an XML element. In addition to the user-defined variables, Data Transformation has several pre-defined system variables. The system variables are used to store information that is needed in certain operations.
User-Defined Variables
To define a variable: 1. 2. 3.
On the Studio menu, click IntelliScript > Insert > Variable. Enter a name for the variable and press Enter. Select the XSD data type that the variable can store. You can select a standard XSD type such as xs:string or xs:integer, or a global type defined in an XSD schema belonging to the project.
The variable appears under the Variables namespace in the Schema view.
Alternatively, you can define a Variable component by editing the IntelliScript directly, without using the menu command. You can add the component only at the top level of the IntelliScript, not at a nested level.
System Variables
Several system variables are defined in every Data Transformation project. The following paragraphs describe the variables and the ways in which they are used.
64
Description A file path. A string containing form data that should be submitted.
The following variables are used in the SubmitForm and SubmitFormGet actions:
Variable
VarFormAction
Description The URL to which a form should be posted. The variable corresponds to the action attribute of the HTML <form> element. A string containing the form data that should be submitted. This is a multiple-occurrence variable. Each occurrence is a complete instance of the form data. For more information, see Multiple-Occurrence Data Holders on page 67.
VarFormData
Description The directory path of the Data Transformation project or service that is currently running.
Description The path of the source document that a parser is processing. The path of the current file that a parser is processing. Usually, this is the same as VarRequestedURL. If the parser is configured with certain preprocessors, VarCurrentURL might point to a temporary file rather than the original source document. VarRequestedURL always points to the source document. The form data that a parser submitted to retrieve the current page.
VarCurrentPost
is a read-only variable that returns system information. It is a structure containing several nested
variables:
Variable
VarSystem/ExecStartTime/Year VarSystem/ExecStartTime/Month VarSystem/ExecStartTime/MonthName VarSystem/ExecStartTime/Day VarSystem/ExecStartTime/DayName VarSystem/ExecStartTime/Hour VarSystem/ExecStartTime/Minute
Description Year when the transformation began execution Numerical month Name of month Day of month Day of week Hour Minute
Variables
65
Variable
VarSystem/ExecStartTime/Second VarSystem/ExecStartTime/Millisecond
stores the service name, directory location of the user log, and the file name of the user log. and VarServiceInfo are structures containing the following nested variables:
Description Failure identifier Failure description Failure location Name of the component that failed Additional information about the failure Name of the Data Transformation service Directory path of the user log File name of the user log
In the IntelliScript, you can configure the initialization property of the Variable. The initial values that you set by this approach are used when you run the transformation in Data Transformation Studio or as a service.
In the Run dialog box of the studio, you can click the Details button and set the initial values. The values that you set in this way are used when you test the transformation in the Studio. For testing purposes only, the values override the ones that you set in the initialization property. They have no effect when you run the transformation as a service.
An application can pass the initial values as service parameters to a service at runtime. The service parameters override the initialization property of the variables. If the IntelliScript specifies an initial value, and you also pass a value from an application, the latter value is used.
66
For more information about how to pass service parameters, see the API references.
Variable
A Variable is a user-defined variable. You can use variables for temporary storage, in the same locations that you use an XML element or attribute. For example, you can map a Content anchor to a variable, and you can use a variable as the input of an action. Variables have XSD data types. They are displayed in the Schema view and in the IntelliScript. You can define a variable only at the top (global) level of the IntelliScript.
Table 7-1. Basic Properties
Property
val_type
Description The data type that the variable can store. Assign a standard type such a: xs:string or xs:integer or a custom type that is defined in the schemas belonging to the project. A custom type can be either simple or complex. In the latter case, the variable is a structure containing nested fields.
Description Select this option to create a multiple-occurrence variable. For more information, see Multiple-Occurrence Data Holders on page 67. An initial value for the variable, assigned when the transformation starts. Select InitialValue and enter the value. You can initialize variables that have simple data types. Initialization of complex variables is not supported. Service parameters, which an application passes to a service at runtime, override the initialization property. You can use the initialization property to set a default that the transformation uses if an application does not pass a service parameter. For more information, see Initializing Variables at Runtime on page 66.
initialization
In a single-occurrence data holder, each assignment overwrites the preceding assignment. In a multiple-occurrence data holder, each assignment generates a new occurrence of the data holder.
To understand this, suppose that an XSD schema defines an XML element called FirstName. If maxOccurs = 1, this is a single-occurrence data holder. If a parser maps more than one Content anchor to the FirstName element, the output contains only the final mapping.
67
Consider what would happen if you parse a source document that is a list of first names:
Jack Jennie Larissa
We assume that each name is a Content anchor mapped to FirstName. Each name overwrites the value of FirstName. The output contains only the mapping:
<FirstName>Larissa</FirstName>
Now suppose that maxOccurs = unbounded. This means that FirstName is a multiple-occurrence data holder. If you map multiple Content anchors to the element, the parser generates a list of names. The output is:
<FirstName>Jack</FirstName> <FirstName>Jennie</FirstName> <FirstName>Larissa</FirstName>
The same principle applies to variables. If you map multiple anchors to a multiple-occurrence variable, each anchor generates a new occurrence of the variable. You can use this feature, for example, to prepare input for the AppendListItems and CombineValues actions, which concatenate the occurrences.
Note: The behavior described here assumes that the multiple-occurrence data holder has a simple XSD type.
Under certain circumstances, if the type is complex, each anchor might not generate a new occurrence. To control this behavior, you can use a locator. For more information, see Locators, Keys, and Indexing on page 191.
Attributes
An XML attribute is always a single-occurrence data holder. An attribute cannot be multiple-occurrence because XML does not permit the same attribute to appear more than once in the same element. An attribute can have an XSD type that is a space-separated list. The names attribute in the following element is an example:
<Countries names=USA Canada Mexico/>
Data Transformation treats the attribute as a single-occurrence data holder with an XSD list type. For more information, see Using XSD Data Types to Narrow the Search Criteria on page 80.
Indexing
By default, Data Transformation accesses the instances of a multiple-occurrence data holder sequentially. You can access the instances non-sequentially by using the indexing feature. For more information, see Locators, Keys, and Indexing on page 191.
68
The schema defines a custom XSD data type called MyListType. The type contains a nested, multipleoccurrence element called item. 2. Define a single-occurrence variable called MyList, which has the data type MyListType.
3.
Use the variable as the target of an iterative structure. For more information, see Locators, Keys, and Indexing on page 191.
Each iteration re-uses the single occurrence of MyList. At the start of the iteration, the nested item elements are destroyed. Anchors within the iterative structure, such as a nested RepeatingGroup, start assigning the item elements from the beginning of the list.
Online Sample
For an example of how to destroy multiple occurrences of a data holder, see the following online sample:
samples\Projects\ResetListVariable\ResetListVariable.cmw
69
70
CHAPTER 8
Anchors
This chapter includes the following topics:
Overview, 71 Mapping Content Anchors to Data Holders, 72 Defining Anchors, 73 Standard Anchor Properties, 76 How a Parser Searches for Anchors, 76 Anchor Quick Reference, 82 Anchor Component Reference, 83 Searcher Component Reference, 98 Anchor Subcomponent Reference, 102
Overview
Anchors are the components that let a parser hook into specific locations in a source document, for the purpose of finding data and storing it in data holders. An anchor is a signpost that you place in a document, indicating the position of the data. This chapter explains the different types of anchors and how you can use them in parsers.
A Marker anchor labels a location in a document. A Content anchor retrieves text from the location.
To understand these anchors, imagine a printed questionnaire. The first line typically asks for the person's last name and first name, with each label followed by a blank space to receive the information. In Data Transformation terminology, the printed labels Last Name and First Name are Marker anchors, and the blank spaces are Content anchors. The anchors provide a means to home in on the data and extract it from the source document.
71
where <tab> is a tab character. You can define First name: as a Marker anchor. You can define Ron as a Content anchor. The parser learns from these definitions that it should search a source document for the string First name:. It should then skip over a single tab delimiter and retrieve the text that follows the tab. Suppose you run the parser on another source document, which contains the following text:
First name:<tab>Jack
The parser finds the anchors as above and retrieves the text Jack. Now suppose that the source document reads:
First name:<tab>Jack<tab>Age:<tab>34
The parser still retrieves the text Jack, rather than Jack<tab>Age<tab>34. This works because you have defined the tab character as a delimiter. Data Transformation understands that the Content anchor starts after the first tab and ends before the second tab. Of course, you might define additional anchors that retrieve Jack's age, which is 34.
Note: The above examples describe one possible behavior of the anchors and delimiters. The anchors have many
properties that let you alter this behavior. For instance, you can define a Content anchor that ignores tabs, even in a tab-delimited format. For more information, see How a Parser Searches for Anchors on page 76.
FirstName,
More precisely, you might specify that the anchor should store the retrieved text at the path /Person/*s/ which refers to the XSD schema. The actual parser output would be:
<Person> <FirstName>Jack</FirstName> </Person>
On the other hand, suppose that the XSD schema defines FirstName as an attribute of the Person element. You might map the Content anchor to /Person/@FirstName. The output would be:
<Person FirstName="Jack" />
72
Chapter 8: Anchors
You must map to a data holder that has an appropriate data type. For example, do not map Jack to an XML element that has an XSD integer data type, or to an XML element that has a complex data type containing nested elements. For more information about this rule, see Using XSD Data Types to Narrow the Search Criteria on page 80.
Note: Do not attempt to type a path such as /Person/*s/FirstName in Data Transformation Studio. When you edit a property whose value is a data holder, the Studio displays a Schema view, where you can select the data holder. The Studio displays the path in the IntelliScript.
Mapping to Variables
You can map an anchor to a data holder that is an XML element, an XML attribute, or a variable. The variable option is useful if you want to use the data in a subsequent processing step, but you do not want to include the raw data in the parser output. For example, suppose you want to extract several numbers from a source document and output their sum in the XML. You do not want the individual numbers in the output. You can map the Content anchors that retrieve the numbers to variables, and use a CalculateValue action to compute and output the sum. You might also map to a variable that you use in a subsequent anchor, for example, to define a dynamic search text for a Marker anchor.
Defining Anchors
When you define a Parser component, you must add a sequence of anchors. The parser operates by searching for the anchors in the source document and by running the operations that you have configured the anchors to perform.
Defining Anchors
73
If you press Enter at the indicated location, Data Transformation displays a drop-down list, which includes the anchors, as well as the other components that you can add. After you add the anchors, the Studio highlights the anchors in the example source.
Some types of anchors can contain nested anchors. For example, you can nest anchors within an Alternatives, Group, or RepeatingGroup anchor.
Sequence of Anchors
The sequence of the anchors should be the sequence of text in the source document. For example, suppose that the source document is:
First Name: Ron Last Name: Lehrer
Assuming that you define First Name and Last Name as Marker anchors, and that you define Ron and Lehrer as Content anchors, the required sequence of anchors in the parser configuration is:
Anchor
Marker Content Marker Content
74
Chapter 8: Anchors
or
Last Name: Lehrer First Name: Ron
In such cases, you can use the marking property to change the search scope of the anchors. For more information, see How a Parser Searches for Anchors on page 76.
Select the anchor text in the example source file. Right-click the selected text and click Insert Marker or Insert Content. In the IntelliScript editor or in the IntelliScript Assistant view, set the anchor properties.
In the example pane, select the anchor text. Drag the text to a data holder in the Schema view. This creates a Content anchor that is mapped to the data holder.
3.
You can also drag and drop from the example pane to the IntelliScript pane. For example, you can drag to the text property of an anchor that is defined with the TextSearch option.
At the desired anchor location, select the three dots symbol (...) and press Enter. Select or type the anchor name. Press Enter again to confirm your selection. Edit the anchor properties.
Defining Anchors
75
Description A name that you assign to the anchor. Data Transformation displays the name in the event log. This can help you find an event that was caused by the particular anchor. A comment describing the anchor. If selected, the parser ignores the anchor. This is useful for testing and debugging, or for making minor modifications in a parser without deleting the existing anchors. Disabling an anchor disables all its nested components, nested anchors, transformers, etc. By default, if an anchor fails, its parent component also fails. If you select the optional property, the parent component does not fail. You can select the optional property to define an anchor that may or may not exist in a source document. If the anchor does not exist, the Parser in which the anchor is nested continues. If the anchor is nested within a Group anchor, the optional property prevents the Group from failing. If the anchor is in a RepeatingGroup, the property prevents an iteration of the RepeatingGroup from failing. For more information, see Failure Handling on page 231. If the anchor fails, writes an entry in the user log. For more information, see Failure Handling on page 231. The direction in which Data Transformation searches for the anchor, within the search scope. If direction = forward, the parser finds the first instance of the anchor within the search scope. If direction = backward, the parser finds the last instance. For example, suppose the search scope for a Marker anchor contains five instances of the word Balance. If direction = forward, the parser finds the first instance of Balance. If direction = backward, it finds the last instance. For a Marker anchor, you can modify this behavior by using the count property. For example, if direction = backward and count = 2, the parser finds the second to last instance. For more information, see How a Parser Searches for Anchors on page 76. Specifies whether an anchor should be used as a reference point to find the succeeding anchor. The options are: - full. Places a reference point before and after the current anchor. - begin position. Before only. - end position. After only. - none. Neither. You can use this property to control the search scope for the succeeding anchor. For more information, see How a Parser Searches for Anchors on page 76. The processing phase during which Data Transformation searches for the anchor, initial, main, or final. By default, Data Transformation searches for Marker anchors during the initial phase and for Content anchors during the main phase. For more information, see How a Parser Searches for Anchors on page 76. This property applies to components that have nested anchors. If the property is selected, the anchor has no initial phase. This overrides the option phase = initial in the immediately nested anchors, and changes it to main.
remark disabled
optional
on_fail
direction
marking
phase
no_initial_phase
Search phase
76
Chapter 8: Anchors
This section explains the concepts, and how you can control each of them by setting the anchor properties.
Search Phases
Data Transformation searches for a sequence of anchors in three phases:
By default, all Marker anchors are in the initial phase and all Content anchors are in the main phase. This means that Data Transformation first finds the Marker anchors, and then it finds the Content anchors between them. To understand this, consider a parser that processes the following source document:
First name: Ron Last name: Lehrer
Suppose you have defined the anchors in the following way, with default anchor properties:
Anchor
Marker Content Marker Content
Phase
Initial Main Initial Main
In the initial phase, Data Transformation searches for the Marker anchors:
It searches for First name:. It searches for Last name: at a location that follows First name:. It searches for the Ron anchor at a location between First name: and Last name:. It searches for the Lehrer anchor at a location after Last name:.
In the main phase, Data Transformation searches for the Content anchors:
Nested Phases
Anchors that have nested anchors, such as Group, have nested phases. For example, if a Group anchor runs in the main phase of a parser, a Marker anchor that is nested in the Group runs in a nested initial phase. The nested initial phase is part of the parser main phase, but it is before the other anchors in the Group. Another example is a RepeatingGroup anchor, which searches for both separators and for nested anchors. In order to identify the nested anchors correctly, it searches for the separators before it searches for the nested anchors.
77
The search scope for the Last name: anchor starts at the end of First name:, and extends to the end of the document. The search criterion is that the anchor must contain the text Last name:. In the main phase, the parser interpolates the Content anchors between the Marker anchors. The search scope for the Ron anchor extends from the end of the First name: anchor to the beginning of the Last name: anchor. Assuming that the parser uses a space-delimited format, the search criteria are to retrieve all the text in the search scope, after the leading space character and before the second space character. The search scope for the Lehrer anchor is from the end of Last Name: to the end of the document. The search criteria are similar to those for the Ron anchor. We can add this analysis to the anchor table that we presented above. The table now describes the complete method by which the parser finds the anchors.
Text in the Source Document
First name: Ron
Anchor
Phase
Search Scope Entire document End of First name: to start of Last name: End of First name: to end of document End of Last name: to end of document
Search Criteria Text = First name: After the leading space Before the next space Text = Last name: After the leading space Before the next space
Marker Content
Initial Main
Marker
Last name:
Initial
Content
Lehrer
Main
In this example, the Marker anchor is located 10 characters after the Content anchor. By default, Data Transformation searches for the Marker in the initial phase, and it searches for the Content in the main phase. This won't work here, because Data Transformation cannot find the Marker unless it has already found the Content! The solution is to change the phase property of one of the anchors. You can change the Content to the initial phase, or the Marker to the main phase. In either case, Data Transformation finds the anchors.
By setting the phase property of the anchor or the surrounding anchors By setting the marking property of the surrounding anchors
Phase Property
If a Content anchor lies between two Marker anchors, then by default, the search scope for the Content is the segment between the Marker anchors. If you change all the anchors to the same phase, the search scope of the Content is no longer bounded by the second Marker. It is from the end of the first Marker to the end of the document. As an example, consider the following source document:
Tree Fig Date<tab>October 27, 2003 (pruned) Tree Date Palm Date April 27, 2003<tab>(planted)
78
Chapter 8: Anchors
The example assumes that the source document has a loose structure, containing varying numbers of spaces, tabs, or other symbols interspersed in the text, so we cannot easily use the spaces and tabs as delimiters. An example like this might arise in parsing word-processor documents. We can parse this document using a RepeatingGroup anchor, which contains nested Marker and Content anchors. The Marker anchors are the strings Tree and Date. The Content anchors are everything between the Marker anchors, including the spaces and tabs. The problem in parsing this document is in the second iteration of the RepeatingGroup, which parses the second line. If we leave the Marker anchors in the initial phase, Data Transformation incorrectly considers the first instance of the word Date to be a Marker. In the main phase, it fails to find Date Palm because the search scope is between the two Marker anchors, and there is no text between them. A possible solution is to move the Marker for Date to the main phase, and to define the Content anchor, Date using an expression that searches for a tree name of one or two words. In the initial phase of the RepeatingGroup, Data Transformation finds the Marker for Tree. In the main phase, it finds Date Palm followed by the Marker for Date.
Palm,
With the new phase setting, we have changed the search scope for the tree name. The scope is now from Tree to the end of the iteration, and Data Transformation finds Date Palm successfully.
Marking Property
Consider the following source-document structure:
MARKER %%%CONTENT A ^^^CONTENT B
Suppose that the sequence of Content A and Content B varies among the source documents. In some documents, Content B precedes Content A. In that case, the search criteria are:
Content A Content A
and Content B both follow the Marker anchor. begins with %%%, and Content B begins with ^^^.
By default, the search scope for Content A is from the end of the Marker to the end of the document. The search scope for Content B is from the end of Content A to the end of the document. This does not work because in some source documents, Content A and Content B are reversed. The solution is to change the search scope for Content B. You can do this by setting the marking property of Content A. The marking property specifies where Data Transformation should place the reference points that determine the start and end of the search scope. The default setting is marking = full, which means that Data Transformation places reference points before and after each anchor. The search scope for Content B begins at the last reference point, which is the one following Content A. This leads to incorrect parsing, as we have seen. To prevent Data Transformation from placing reference points around Content A, set the marking property of Content A to none. As a result, the search scope for Content B starts at the end of the Marker. This allows Data Transformation to find Content B, even if it precedes Content A.
79
The following table describes all four possible values of the marking property. The Result column assumes that you assign the marking value to Content A in the above example.
Marking Property
full
Explanation Data Transformation places reference marks at the beginning and end of the current anchor. This is the default behavior. Data Transformation places a reference mark only at the start of the current anchor. Data Transformation places a reference mark only at the end of the current anchor. Data Transformation does not place any reference marks at the current anchor.
Result Data Transformation seeks the next anchor after the end of the current anchor. Content B follows Content A. Data Transformation seeks the next anchor after the start of the current anchor. Content B overlaps or follows Content A. Data Transformation seeks the next anchor after the end of the current anchor. Content B follows Content A. Data Transformation seeks the next anchor after the end of the preceding anchor. Content B follows Marker, without regard to Content A.
begin position
end position
none
Note: There are a few circumstances where you must use an anchor that marks a reference point. An example is
the separator of a RepeatingGroup. If the separator does not mark, it does nothing. Data Transformation Studio displays a warning if you attempt to use a non-marking anchor in a location where marking is required.
Online Samples
Marking_Mode.cmw.
For an online sample of the marking property, open the project samples\Projects\Marking_Mode\ The sample uses the property to alter the search scope of a Content anchor.
For another example, see samples\Projects\NonMarker\NonMarker.cmw. This sample uses the marking = option, permitting two Content anchors to overlap. The sample also illustrates the use of direction = backward to search from the end of the scope.
none
According to the delimiter locations, which Data Transformation learns from the example source According to a positional offset, in other words, the number of characters from a reference point By searching for particular text By searching for a pattern or regular expression By searching for a specified data type By searching for an attribute value
You can combine these search criteria in almost any way. For example, you might specify that a Content anchor begins two tabs after a Marker anchor, and that it is 10 characters long. If you do this, you are using a delimiter criterion to define the beginning of the Content anchor, and an offset criterion to define the end. The components that perform these searches are called searcher components. For more information, see Searcher Component Reference on page 98.
80
Chapter 8: Anchors
Further suppose that you define no other search criteria for the anchor. If you map the anchor to a data holder that has a type of xs:string, the anchor retrieves the entire string. If the data holder has a type of xs:integer, Data Transformation searches for the first substring that matches the data type. Assuming that you configure the anchor with direction = forward, the anchor retrieves the integer 81. If direction = backward, the anchor retrieves 95. Now suppose the data holder has a type of xs:integer, and the schema restricts the data holder to values less than 60. Data Transformation searches for an integer that conforms to the restriction and retrieves 56.
The expression searches for two commas, separated by any characters other than a newline. The search finds the substring
, 56,
If the XSD type of the data holder is xs:integer, the anchor retrieves 56.
List Types
A data holder can have an XSD list type, which is a space-separated list. Data Transformation filters the text retrieved by the Content anchor to match the XSD types of the list items. Suppose that the schema defines an attribute called grades, which is a list of xs:integer items. If you map the above Content anchor to grades, the anchor returns a list of the integers in the string:
81 56 95
If the grades attribute belongs to an element called Students, the XML output is:
<Students grades=81 56 95" />
If you define the Content anchor with direction = backward, the list is reversed:
<Students grades=95 56 81" />
Decimal Type
If a data holder has the xs:decimal type, Data Transformation assumes that the decimal separator is a period. If your locale setting uses a comma as the decimal separator, an xs:decimal search might fail.
Online Sample
For an online example of searching by an XSD type, open the project samples\Projects\Pattern\ The sample is a parser containing a single Content anchor that is mapped to an XML element. The XSD schema uses an xs:pattern to restrict the element to certain character sequences. The anchor outputs the portion of the source document that matches the pattern.
Pattern.cmw.
81
Data Transformation searches first for Marker A and Marker E. The search scope of the Group is the region between Marker A and Marker E. Then, within the search scope of the Group, Data Transformation searches for Marker B and Marker D. The region between these Marker anchors is the search scope for Content C. Within the latter search scope, Data Transformation searches for Content C. You can view these relationships in the example pane of Data Transformation Studio. The example pane highlights the nested anchors, helping you visualize the extent of the Group.
Simple Anchors
The anchors in this category are used to define simple text elements in a document.
Anchor
Content
Description Retrieves text from a specified location in a source document and stores the text in a data holder Defines a reference point in the source text. The parser uses the reference point to search for other anchors.
Marker
82
Chapter 8: Anchors
Grouping Anchors
These anchors group a set of nested anchors together.
Anchor
DelimitedSections EnclosedGroup Group RepeatingGroup
Description Defines sections of a document that are delimited by a separator. Defines a bounded segment of the source document. Binds a set of anchors together for processing as a unit. Parses a repetitive section of a document.
Other Anchors
Anchor
Alternatives EmbeddedParser FindReplaceAnchor HtmlForm
Description Specifies alternative anchors that may exist at a particular location in a source document. Activates a secondary parser that runs on a segment of the source document. Marks text for replacement. Used with the TransformByParser transformer. Defines an HTML form. The anchor submits the form to a web server and runs a secondary parser on the server response.
Alternatives
The Alternatives anchor allows you to define a set of alternative, nested anchors. You can define a criterion for the alternative that the parser should accept. Only the accepted anchor affects the parser output. The other anchors, whether failed or successful, have no effect on the parser output.
Example
Suppose you are parsing a document in which a date can appear in either of the following patterns:
21/10/03 October 21, 2003
To process this content, you can define an Alternatives anchor that contains two Content anchors that store their output in different XML elements. Each XML element is constrained to accept one of the date patterns. The Alternatives anchor is configured with selector = ScriptOrder. When the parser runs the Alternatives anchor, it tests the first Content anchor. If the date matches the pattern of the first anchor, the first Content anchor succeeds. If the date does not match the pattern, the first Content anchor fails, and the Alternatives anchor tests the second Content anchor. In this way, the parser can process both date patterns.
83
How to Define
Add an Alternatives anchor by editing the IntelliScript. Nested within the Alternatives anchor, add the alternative anchors.
Table 8-1. Basic Properties
Property
selector
Description The criterion for deciding which alternative to accept. The options are: - ScriptOrder. Data Transformation tests the nested anchors in the sequence that they are defined in the IntelliScript. It accepts the first nested anchor that succeeds. If all the nested anchors fail, the Alternatives anchor fails. - DocumentOrder. Data Transformation tests all the nested anchors. It accepts either the first or last successful nested anchor, according to the locations of the anchors in the source document. If all the nested anchors fail, the Alternatives anchor fails. - NameSwitch. Data Transformation searches for the nested anchor whose name property is specified in a data holder. It ignores the other nested anchors. If the named nested anchor fails, the Alternatives anchor fails.
Description For more information about these properties, see Standard Anchor Properties on page 76.
You can support this situation in the following way: 1. 2. 3. 4. The main parser retrieves the filename of an article and stores it in a variable. The main parser contains an Alternatives anchor that is configured with the DocumentOrder option. The Alternatives anchor contains nested Group anchors. Each Group anchor is configured with a Marker anchor and a RunParser action, as follows:
The first Group contains a Marker that searches for the string News. The Group is configured with a RunParser action that runs a secondary parser called NewsParser. The second Group contains a Marker that searches for Business and runs BusinessParser. The third Group contains a Marker that searches for the Sports and runs SportsParser.
The Alternatives anchor tests all three Group anchors. It accepts the Group containing the first Marker that occurs after the filename. The Group runs the appropriate parser on the file.
84
Chapter 8: Anchors
Online Sample
For an online sample of this anchor, open the project samples\Projects\Alternatives\Alternatives.cmw. The sample uses Alternatives anchors to parse different name and date formats that may exist in a source document.
Content
A Content anchor retrieves text from the source document. It stores the retrieved text in a data holder.
How to Define
You can create a Content anchor by working in either the example source or the IntelliScript. For more information, see Defining Anchors on page 73.
Table 8-3. Basic Properties
Property
opening_marker
Description A searcher component labeling the start of a region, in which Data Transformation should search for the Content anchor. Defining this property is similar to defining a Group containing a Marker followed by a Content anchor. The possible property values are NewlineSearch, PatternSearch, OffsetSearch, and TextSearch. For example, a NewlineSearch means that Data Transformation should search for the anchor after a newline character. A TextSearch means to search after a specified text string. For more information, see the Searcher Component Reference on page 98. A searcher component labeling the end of a region, in which Data Transformation should search for the Content anchor. Defining this property is similar to defining a Group containing a Content anchor followed by a Marker. Defining both opening_marker and closing_marker is similar to defining a Group containing a Marker Content Marker sequence. The property values are the same as for opening_marker. Specifies a searcher component that searches for the text retrieved by the Content anchor. The search is between opening_marker and closing_marker. If opening_marker is not defined, the search is between the surrounding reference points. For more information, see How a Parser Searches for Anchors on page 76. The options are: - Empty. The Content anchor retrieves the entire search scope. - AttributeSearch. The Content anchor retrieves the value from an expression of the type AttributeName=.... This is useful, for example, to retrieve attribute values from an XML or HTML source document. - LearnByExample. The parser learns what text to retrieve according to the parser format and the example source. For example, if the parser has a tab-delimited format, it counts the number of tabs from the start of the search scope to the example text. It retrieves the text between the corresponding tabs in the source document. - PatternSearch. The Content anchor retrieves the first text that matches a specified regular expression. - TypeSearch. The Content anchor retrieves the first text that matches a specified XSD data type. For more information about these options, see the Searcher Component Reference on page 98. In addition to the searcher components, Data Transformation uses the XSD type of the data_holder as a search criterion. For more information, see Using XSD Data Types to Narrow the Search Criteria on page 80. A data holder where the anchor should store the retrieved text.
closing_marker
value
data_holder
85
Description If selected, the Content anchor can be empty. The data_holder is assigned an empty value. This can occur, for example, if the anchor is configured with value = LearnByExample and there is nothing between the delimiters. It can also occur if there is nothing between the opening_marker and the closing_marker. If allow_empty_values is not selected in these situations, the anchor fails. If not selected, the anchor searches for data within its search scope that matches the XSD type of the data holder. If selected, the anchor searches without regard to the XSD type. If the result, following the application of any transformers, does not have the proper type, the data cannot be stored in the data holder and the anchor fails. For more information, see Using XSD Data Types to Narrow the Search Criteria on page 80. If selected, the anchor does not apply the default transformers to the content. For more information, see Transformers on page 105. A sequence of transformers that Data Transformation should apply to the retrieved text. For more information, Transformers on page 105. For more information about these properties, see Standard Anchor Properties on page 76.
disable_XSD_type_search
ignore_default_transformers
transformers
name
Search Direction
The direction property has multiple effects in a Content anchor. If direction = backward:
Data Transformation searches backward from the end of the search scope for the opening_marker and closing_marker. Opening_marker still precedes closing_marker. The searcher component searches backward from the end of the search scope. If the searcher component is LearnByExample, it counts the delimiters backward from the end of the search scope.
Online Sample
For an online sample of Content anchors, open the project samples\Projects\Content\Content.cmw. The sample illustrates several uses of the opening_marker, closing_marker, and value properties to configure Content anchors.
DelimitedSections
The DelimitedSections anchor parses sectioned data that is delimited by a separator. Within the DelimitedSections, nest other anchors. Each nested anchor is responsible for parsing a single section.
86
Chapter 8: Anchors
Example
An employee resume form contains several sections, each of which is preceded by a line of hyphens:
---------------------------Jane Palmer Employee ID 123456 ---------------------------Professional Experience ... ---------------------------Education ...
You can define the sectioned region as a DelimitedSections anchor, with the line of hyphens as the separator. Because the line of hyphens precedes each section, define the separator_position as before. Within the DelimitedSections anchor, nest three Group anchors. The first Group parses the Jane Palmer section, the second Group parses the Professional Experience section, and so forth.
Optional Sections
In the above example, suppose that the second section, Professional Experience, is missing from some source documents. Its separator, the line of hyphens, is always present.
---------------------------Jane Palmer Employee ID 123456 ------------------------------------------------------Education ...
In the second Group anchor, select the optional property. This means that if the Group fails, it does not cause the DelimitedSections to fail. In the DelimitedSections anchor, set using_placeholders = always. This means that the anchor looks for the separator of the optional section, even if the section itself is missing.
Now suppose that if the Professional Experience section is missing, its separator is also missing.
---------------------------Jane Palmer Employee ID 123456 ---------------------------Education ...
In the second Group anchor, select the optional property. In the DelimitedSections anchor, set using_placeholders = never. This means that the anchor should not look for the separator of a missing section.
How to Define
Add a DelimitedSections anchor by editing the IntelliScript. Nested with the DelimitedSections anchor, add a sequence of anchors that parse the sections.
Table 8-5. Basic Properties
Property
separator
Description An anchor that delimits the sections. The anchor is typically a Marker.
87
Description Position of the separator relative to the sections. The options are before, after, between, and around. Specifies whether the DelimitedSections should look for the separator of an optional section that is missing from the source document. The options are always, never, and when necessary.
using_placeholders
The following table illustrates the possible values of the separator_position property. The examples assume that the separator is a vertical-line character ( |).
separator_position
before after between
Explanation There is a separator before each section, including the first section. There is a separator after each section, including the last section. There is a separator between the successive sections, but not before the first section and not after the last section. There are separators before and after each section, including the first and last sections.
Example
|1|2|3|4 1|2|3|4| 1|2|3|4
around
|1|2|3|4|
The following table illustrates the possible values of the using_placeholders properties. The examples assume that the separator_position is before and that sections 2 and 4 are missing.
using_placeholders
always never when necessary
Explanation The separator of a missing section always exists. The separator of a missing section never exists. The separator of a missing internal section always exists. The separator of a missing terminal section never exists.
Example
|1||3| |1|3 |1||3
Description For more information about these properties, see Standard Anchor Properties on page 76.
Online Sample
For an online sample of this anchor, open the project samples\Projects\DelimitedSections\ DelimitedSections.cmw. The sample illustrates a DelimitedSections anchor that parses sections separated by a | symbol. Each section is parsed by a single Content anchor.
EmbeddedParser
The EmbeddedParser anchor uses a secondary parser to parse its search scope. It is permitted for a parser to call itself recursively.
88
Chapter 8: Anchors
Example
A document is tab-delimited, except for one section that is comma-delimited. To parse the document, you can define a main parser that uses the TabDelimited format. Define another parser that uses the CommaDelimited format. Use an EmbeddedParser anchor to run the second parser within the execution of the first parser.
How to Define
You can define an EmbeddedParser by editing the IntelliScript.
Table 8-7. Basic Properties
Property
parser schema_connections
Description The name of the secondary parser, which must be defined in the same project. Connects the output of the secondary parser to the output of the main parser. The property contains a list of Connect subcomponents that define the relation between data holders in the output of the two parsers. For more information, see Connect on page 103.
Description A sequence of transformers that the parser applies to the search scope before the secondary parser processes it. For more information about these properties, see Standard Anchor Properties on page 76.
name
Online Sample
For an online sample of this anchor, open the project samples\Projects\EmbeddedParser\ EmbeddedParser.cmw. The sample uses a main parser to determine the location of an address. It then runs an EmbeddedParser to parse the address.
EnclosedGroup
The EnclosedGroup anchor defines a bounded region that contains nested anchors. The boundaries are specified by opening and closing anchors. In the case of nested boundaries, such as parentheses or HTML tags, the EnclosedGroup finds the matching boundaries. An EnclosedGroup is similar to a Content anchor with an opening_marker and closing_marker. However:
The Content anchor retrieves the entire content between the opening and closing, without further parsing. The EnclosedGroup allows you to further parse the content between the opening and closing.
Example
You can define an HTML table as an EnclosedGroup, with the <table> and </table> tags as the opening and closing. The nested anchors parse the content of the table.
Anchor Component Reference 89
Suppose the <table> element contains a nested <table> element. In other words, a table is nested within a table cell. The EnclosedGroup anchor matches the parent <table> tag with the parent </table> tag. It does not match the parent <table> tag with the nested </table> tag, which would be a misidentification of the table.
How to Define
You can define an EnclosedGroup anchor by editing the IntelliScript. Add the nested anchors that parse the content.
Table 8-9. Basic Properties
Property
opening closing
Description The opening anchor of the EnclosedGroup, typically a Marker anchor. The closing anchor of the EnclosedGroup, typically a Marker anchor.
Description These properties are useful in situations where the anchor must select specific occurrences of data holders. For more information, see Locators, Keys, and Indexing on page 191. For more information about these properties, see Standard Anchor Properties on page 76.
name
FindReplaceAnchor
This anchor is intended for use within a parser that is activated by the TransformByParser transformer. The anchor marks text in the source, and it specifies a replacement for the text. When the parsing is done, the TransformByParser transformer uses the markings to modify the text.
FindReplaceAnchor
If FindReplaceAnchor does not contain any nested anchors, it replaces the complete text within its search scope. For example, if FindReplaceAnchor is between two Marker anchors, it marks the text between them. If FindReplaceAnchor contains nested anchors, it replaces the text spanned by the nested anchors. For example, if FindReplaceAnchor contain a Marker, it replaces the Marker. If it contains two Marker anchors, it replaces the segment from the first Marker to the second, including the Marker anchors themselves.
You can configure the anchor with a static replacement string or with a string that the parser retrieves dynamically from the source document. For more information, see TransformByParser on page 131.
Example
You have a text document, to which you want to add line numbers. You can add the line numbers by the following approach:
90
Chapter 8: Anchors
1. 2. 3.
Create a parser, and add a RepeatingGroup to it. Within the RepeatingGroup, add a FindReplaceAnchor. Within the FindReplaceAnchor, add a Marker anchor, and set its search property to NewlineSearch. This causes the FindReplaceAnchor to mark every newline in the document.
4. 5. 6.
Configure the RepeatingGroup to store its current_iteration in a variable. Set the replace_with property of the FindReplaceAnchor to the variable. At the global level of the IntelliScript, define a TransformByParser transformer. Set its parser property to the parser. Set the TransformByParser as the startup component of the transformation. The transformer outputs a modified version of the original file, containing line numbers.
How to Define
You can define a FindReplaceAnchor by editing the IntelliScript. If required, add nested anchors marking a substring to be replaced.
Table 8-11. Basic Properties
Property
replace_with
Description Type the replacement string, or browse to a data holder that contains the text.
Description If the FindReplaceAnchor does not find all its nested, non-optional anchors, and on_partial_match has a value of fail, the FindReplaceAnchor fails. If on_partial_match has a value of skip, the FindReplaceAnchor removes the area spanned by the successful nested anchors from its search scope and tries to find all the nested anchors again. It iterates this procedure until it finds the anchors, as long as there is a partial match. These properties are useful in situations where the anchor must select specific occurrences of data holders. For more information, see Locators, Keys, and Indexing on page 191. For more information about these properties, see Standard Anchor Properties on page 76.
source target
name
Group
The Group anchor binds a sequence of anchors and actions together. It allows you to apply properties to all the nested components, together. For example, a Group allows you to define operations that Data Transformation should perform on a set of anchors or to control the phase of the nested anchors.
Anchor Component Reference 91
How to Define
You can define a Group by editing the IntelliScript. Add nested anchors, and optionally actions, that parse the content of the Group.
Optional Group
You can use the optional property of a Group to prevent Data Transformation from attempting to retrieve text from a missing section of a document. For example, to parse the source
First name: Ron
you might define First name: as a Marker and Ron as Content. If some source documents do not contain the first-name data, you can put the Marker and Content in a Group and make it optional. If First name: is not found, the Group immediately fails. The parser does not search for the Content anchor. There is a difference between making the Group optional and making its nested anchors optional. If you make both the Marker and Content optional, instead of the Group, Data Transformation ignores the Marker failure and searches for the Content. This might result in retrieving irrelevant text.
Table 8-13. Advanced Properties
Property
absent
Description If selected, the Group succeeds only if one of its nested, non-optional anchors or actions fails. You can use this feature to test for the absence of nested anchors. If the Group does not find all its nested, non-optional anchors, and on_partial_match has a value of fail, the Group fails. If on_partial_match has a value of skip, the Group removes the area spanned by the successful nested anchors from its search scope and tries to find all the nested anchors again. It iterates this procedure until it finds the nested anchors, as long as there is a partial match. The order in which to process the nested anchors. The options are: - top-down. The nested anchors are processed in the sequence that is defined in the IntelliScript. - bottom-up. The nested anchors are processed in reverse order. This is useful if data from a later anchor affects how you process an earlier anchor. These properties are useful in situations where the anchor must select specific occurrences of data holders. For more information, see Locators, Keys, and Indexing on page 191. For more information about these properties, see Standard Anchor Properties on page 76.
on_partial_match
search_order
source target
name
Online Sample
For an online sample of this anchor, open the project samples\Projects\persistent_search\ persistent_search.cmw. The sample illustrates a Group that is configured with the on_partial_match = skip property. The Group contains two Marker anchors:
92
Chapter 8: Anchors
The first Marker searches for the text A. The second Marker searches for a string containing any number of * characters. It has the adjacent property, which means that it must be adjacent to the first Marker.
On the first pass, the Group finds an A character at the beginning of the source document. It does not find the second Marker adjacent to the A character, however. The Group reduces its search scope by eliminating the first A character, and searches again for the two adjacent Marker anchors. It continues this procedure until it successfully finds a string A*, which contains the adjacent Marker anchors. You can observe the behavior in the event log. The log records that the Group fails on the first two trials and succeeds on the third. Try experimenting with the on_partial_match and adjacent settings. You can see the effect in the color coding of the example source. You can also try running the sample, although the result file is empty because the parser does not contain Content anchors. If you set on_partial_match = fail, you can observe in the event log that the parser fails, because the Group cannot find the adjacent anchors.
HtmlForm
Note: This component is provided for compatibility with projects created in earlier Data Transformation
versions. It is being phased out of the Data Transformation system. Do not use it in new projects. is an anchor that marks a <form> element in an HTML source document. It submits the form to a web server specified in the action attribute of the form. It then activates a secondary parser that parses the server response.
HtmlForm
Within the field_filters property of HtmlForm, you can modify the fields and values of the form. The HtmlForm anchor collects the possible values of the form fields, combines them, and submits all the combinations. You can distribute the submissions over multiple computers.
HtmlForm
appends the parsed output of the web-server responses to the main parser output.
How to Define
You can define an HtmlForm by editing the IntelliScript. The following are some tips that can help you configure the anchor with a minimum of effort: 1. 2. 3. 4. Prepare and run the anchor without any filters. In the example pane, confirm that the HtmlForm anchor highlights the correct form. Examine the Results\_HtmlForm.xml file, which contains the form data that was submitted. Confirm that the anchor included all the fields. Add filters or adjust the fields as required.
Description The name of the secondary parser that parses the server response. Adds fields and their values to the form data that the anchor submits. Specify a sequence of AddField, ModifyField, and RemoveField subcomponents that generate the desired fields. Be sure to use the same field names as in the original HTML form. For more information, see the Anchor Subcomponent Reference on page 102. Specifies the HTML element that a simulated user clicks to submit the form. The options are ImageClick and SubmitClick. For more information, see the Anchor Subcomponent Reference on page 102.
click
93
Description Select the portion of the possible field-value combinations to submit. The options are SubmitAll, SegmentIndex , and SegmentSize. For more information, see the.Anchor Subcomponent Reference on page 102. The number of retries, if the anchor cannot connect to the web server on the first attempt. The interval in seconds between retries. The name of a JavaScript function that exists in the source document. The anchor calls the function before it submits the form. A list of data holders containing parameters of js_function. The parameters must be in the same order as in the function declaration. For more information about these properties, see Standard Anchor Properties on page 76.
retries
seconds_to_wait js_function
js_params
name
Marker
A Marker defines a location in a source document. It is used as a reference point, from which Data Transformation searches for the succeeding anchors. By default, the phase property of a Marker is initial, which means that Data Transformation scans a document for Marker anchors before it searches for Content anchors. For more information, see How a Parser Searches for Anchors on page 76.
How to Define
You can define a Marker by the select-and-click method or by editing the IntelliScript. For more information, see Defining Anchors on page 73.
Table 8-16. Basic Properties
Property
search
Description Defines the search criteria for the Marker. The search criteria determine where the Marker is located within the search scope. For example, a NewlineSearch locates the Marker at a newline character. A TextSearch locates the Marker at a specified string. For more information, see How a Parser Searches for Anchors on page 76. The value of this property is one of the following searcher components.: - NewlineSearch. Searches for a newline character. - TextSearch. Searches for a predefined text string or for a text string that is stored in a data holder. - PatternSearch. Searches for a string that matches a specified regular expression. - OffsetSearch. Skips a predefined number of characters following the preceding reference point, or a number of characters that is stored in a data holder. The Marker is the point following the skipped characters. - TypeSearch. Searches for a string that conforms to a specified XSD data type. For more information, see the Searcher Component Reference on page 98.
94
Chapter 8: Anchors
Description If selected, the Marker must be adjacent to the anchor at the beginning of its search scope. If direction = backward, it must be adjacent to the anchor at the end of its search scope. If not selected, Data Transformation can skip over text until if finds the Marker. If selected, the Marker is a test that the specified text or pattern is absent from the document. If Data Transformation finds the Marker, the Marker fails. The occurrence number to find. For example, to set the Marker at the second newline following the preceding anchor, set search = NewlineSearch and count = 2. For more information about these properties, see Standard Anchor Properties on page 76.
absent
count
Online Sample
In the Online Samples folder, open Projects\Markers\Markers.cmw. The sample demonstrates Marker anchors that search for:
A predefined text string A newline character An offset A data type A regular expression
If you run the parser, note that the result file is empty because the configuration does not have any Content anchors.
RepeatingGroup
The RepeatingGroup anchor parses a repetitive region of a source document. The repeating units are called iterations. The iterations are typically delimited by a separator. The RepeatingGroup contains a sequence of nested anchors and actions that parse each iteration. The RepeatingGroup anchor treats all iterations in the same way. To parse a semi-repetitive region containing sections that require differing treatment, you can use a DelimitedSections anchor, instead.
How to Define
You can define a RepeatingGroup by editing the IntelliScript. Add the nested anchors, and optionally actions, that parse each iteration of the RepeatingGroup.
If the RepeatingGroup is configured with a separator, it searches for the next separator. Then, it searches for the anchors lying between a pair of separators. If the RepeatingGroup is not configured with a separator, it searches only for the anchors.
End of a RepeatingGroup
You can signal the end of a RepeatingGroup in ways such as the following:
The RepeatingGroup can continue until the end of the document. You can insert a Marker after the RepeatingGroup. By default, the Marker is in an earlier search phase than the RepeatingGroup. This causes the parser to search for the Marker first and use it to limit the search scope of the RepeatingGroup. For more information, see Adjusting the Search Phase on page 78. You can set the count property, limiting the search to a maximum number of iterations. If the RepeatingGroup does not have a separator, it ends when the parser cannot find any more iterations.
If the RepeatingGroup does not have a separator, the RepeatingGroup ends. Provided that there was at least one successful iteration prior to the failed iteration, the RepeatingGroup succeeds. If the RepeatingGroup has a separator, and the skip_failed_iterations property is not selected, the RepeatingGroup fails. If the RepeatingGroup has a separator, and the skip_failed_iterations property is selected, Data Transformation skips over the failed iteration and proceeds with the next iteration. Provided that at least one iteration succeeds, the RepeatingGroup succeeds.
Description An anchor, typically a Marker, that delimits the iterations. If you leave this property empty, the RepeatingGroup does not look for a delimiter between the iterations. Instead, it assumes that an iteration is finished when it has found all the nested anchors. It then starts to parse the next iteration from the top of the nested anchor sequence. You can build a complex separator by inserting a Group in the separator property instead of a Marker. Position of the separator relative to the iterations of the RepeatingGroup. The options are before, after, between, and around.
separator_position
96
Chapter 8: Anchors
The following table illustrates the possible values of the separator_position property. The examples assume that the separator is a vertical-line character ( |).
Separator_position
before after between
Explanation There is a separator before each iteration, including the first iteration. There is a separator after each iteration, including the last iteration. There is a separator between the successive iterations, but not before the first iteration and not after the last iteration. There are separators before and after each iteration, including the first and last iterations.
Example
|1|2|3 1|2|3| 1|2|3
around
|1|2|3|
Description This option has an effect only if the RepeatingGroup has a separator. By default, this option is selected. This means that the RepeatingGroup skips over a failed iteration and proceeds with the next iteration. Provided that at least one iteration succeeds, the RepeatingGroup succeeds. If you deselect the option, the RepeatingGroup fails if any iteration fails. The order in which to process the nested anchors within each iteration. The options are: - top-down. The nested anchors are processed in the sequence that is defined in the IntelliScript. - bottom-up. The nested anchors are processed in reverse order. This is useful if data from a later anchor affects how you process an earlier anchor. The order in which to process the iterations. The options are the same as for search_order, but apply to the iterations rather than to the anchors within an iteration. The number of iterations to run. Enter a number, or browse to a data holder that contains the number. If blank, the iterations continue until the search scope is exhausted. If count = 0, the RepeatingGroup does not search for iterations. In this case, the RepeatingGroup succeeds, but it does not produce any output. A data holder where the RepeatingGroup outputs the number of the current iteration. This option controls the behavior if, in a particular iteration, the RepeatingGroup finds some but not all of its nested, non-optional anchors. In such a case, if on_partial_match has a value of fail, the iteration fails. If on_partial_match has a value of skip, the RepeatingGroup removes the area spanned by the successful nested anchors from its search scope and tries to find all the nested anchors again. The removal-retry procedure is repeated until the iteration succeeds, or until there is no longer a partial match. In the latter case, the iteration fails. These properties are useful in situations where the anchor must select specific occurrences of data holders. For more information, see Locators, Keys, and Indexing on page 191. If an iteration fails, writes an entry in the user log. User the on_fail property to write an entry if the entire RepeatingGroup fails. Use on_iteration_fail to write an entry if a single iteration fails. For more information, see Failure Handling on page 231. For more information about these properties, see Standard Anchor Properties on page 76.
search_order
iteration_order
count
current_iteration
on_partial_match
source target
on_iteration_fail
name
97
Description
Online Samples
Dynamic_And_RepeatingGroup.cmw.
For an online example of this anchor, open the project samples\Projects\Dynamic_And_RepeatingGroup\ The sample uses a RepeatingGroup to iterate over the lines of a document. Some lines of the source document contain a parenthesized footnote reference, such as "(1)". The RepeatingGroup contains a Group, whose purpose is to parse the footnote and insert its content in the output.
The Group contains a Content anchor that retrieves the footnote reference and stores it in a variable. The Group then activates a RunParser action that activates a secondary parser. The secondary parser finds the footnote referenced by the variable, parses it, and inserts the result in the output.
To define the location of anchors. For more information, see Anchor Component Reference on page 83. To define delimiter characters or strings. For more information, see Formats on page 43. To define the find_what string of a Replace transformer. For more information, see Transformers on page 105.
AttributeSearch
This component searches a source document for a specified attribute. The attribute must occur in an expression of the type:
AttributeName = value
or
AttributeName = "value"
The component retrieves the value. The component is a possible setting of the value property, which belongs to the Content anchor. For more information, see Content on page 85.
Example
An HTML document contains the element:
<img src='MyPicture.gif'>
You can use AttributeSearch to retrieve the value of the src attribute. It returns the text MyPicture.gif.
For example, suppose that AttributeSearch is configured to search for an attribute called time. In all of the following examples, it returns the same value, 12:55:33.
time = 12:55:33 time=12:55:33 time = '12:55:33' time='12:55:33' time = "12:55:33" time="12:55:33"
Online Sample
For an online sample of this component, open the project samples\Projects\Content\Content.cmw. The sample illustrates the use of an AttributeSearch to parse a text document that has a variable = value structure.
LearnByExample
This component learns how to search for text by examining the text location in the example source document. It uses the parser format to interpret the source document. For example, if the parser has a tab-delimited format, LearnByExample counts the number of tabs from the search start to the example text. It searches for text in the source document that lies at the same number of tabs from the start of the search scope. The component is a possible settings of the value property, which belongs to the Content anchor. For more information, see Content on page 85. If the Content anchor is configured with direction = backward, the component counts the delimiters from the end of the search scope.
Table 8-22. Basic Properties
Property
example
Description The text in the example source document at the anchor location.
NewlineSearch
This component searches for a newline, a linefeed character, a carriage return character, or both. Anchors can use NewlineSearch to find newline markers. A Delimiter component can use NewlineSearch to find newline delimiters.
99
OffsetSearch
This component defines the number of characters between a reference point and an anchor. For example, it can define the number of characters between the end of a Marker and the start of a Content anchor.
Table 8-23. Basic Properties
Property
offset
Description The number of characters between the reference point and the anchor. In some locations where OffsetSearch is used, such as in a Marker anchor, Data Transformation Studio displays a browse button next to the offset property. You can enter a value or browse to a data holder containing the value.
Description If the offset is beyond the search scope, this property allows a smaller offset. This is useful, for example, to permit a truncated field size at the end of a document.
PatternSearch
This component searches for a string that conforms to a regular expression. Regular expressions are a way to define a text search criterion, similar to a wildcard search, but with greatly enhanced syntax. For more information about the regular expression syntax that Data Transformation supports, see RegularExpression on page 125. Anchors can use PatternSearch to find markers or content. The Delimiter component can use PatternSearch to find delimiters. The Replace transformer can use PatternSearch to find the text to be replaced.
Example
Suppose you want to define the string %%%, containing one or more % symbols, as a delimiter. Within the Delimiter component, you can use PatternSearch with the following regular expression:
%+
In another example, suppose you want to define a comma and a semicolon as alternative delimiters, at the same level of the delimiter hierarchy. You can use the following regular expression:
[,;]
Description A prefix in the source document, such as a backslash character \, that causes the search component to ignore an instance of the pattern.
SegmentSearch
This component searches for opening and closing markers in a text string. It returns the segment from the opening marker to the closing marker, including the markers themselves.
100
Chapter 8: Anchors
The component is used in the Replace transformer to find text that is to be replaced.
Table 8-27. Basic Properties
Property
Opening
Description The search criterion for the opening marker. The options are searcher components: TextSearch, PatternSearch, NewlineSearch, or OffsetSearch. The search criterion for the closing marker.
Closing
TextSearch
This component searches for an explicit string. Anchors can use TextSearch to find markers. The Delimiter component can use TextSearch to find delimiters. The Replace transformer can use TextSearch to find text that is to be replaced.
Example
To define the string percent-percent-tab as a delimiter, create a Delimiter component and set its search property to TextSearch. In the TextSearch/text property, type:
%%
Then press Ctrl+a, and type the ASCII code of a tab character:
009
For example, suppose that you want to find repeated instances of the first word in a document. You can define a Content anchor that retrieves the first word and stores it in a variable. You can then define Marker anchors that use TextSearch to find other instances of the word that you stored in the variable.
Table 8-28. Basic Properties
Properties
text
Description The string to find. In locations where dynamic search is supported, you can browse to a data holder containing the string. To type control characters, press Ctrl+a and enter their ASCII codes. The IntelliScript displays a tab as . Other special characters appear as ASCII codes prefixed with a dot.
101
Description If selected, text is required to match the text property exactly, with the same upper and lower-case letter. A prefix in the source document, such as a backslash character \, that causes the search to ignore an instance of the string. In locations where dynamic search is supported, you can browse to a data holder containing the escape sequence.
escape_sequence
Online Sample
For an online sample of this component, open the project samples\Projects\Dynamic_And_RepeatingGroup\
Dynamic_And_RepeatingGroup.cmw.
In the GetRemarkParser component of this sample, a Marker anchor uses a dynamically defined TextSearch to find a footnote at the end of the source document. For more information about this sample, see RepeatingGroup on page 95.
TypeSearch
This component searches for an anchor of a specified XSD data type. The component is a possible settings of the value property, which belongs to the Content anchor. For more information, Content on page 85.
Table 8-30. Basic Properties
Property
val_type
AddField
This is an option of the field_filters property of HtmlForm. It adds a field to be submitted with an HTML form. Optionally, you can define multiple values. The HtmlForm anchor submits all possible combinations of the values with the values of other fields.
Table 8-31. Basic Properties
Property
field_name filter
Description Name of the field. The way to generate the field values. The options are: UseDataHolder. Assigns a value that is contained in a data holder. To assign multiple values, use a multiple-occurrence data holder. UseValues. Assigns one or more explicit values. The ExcludeValues option is not in use.
102
Chapter 8: Anchors
Connect
This component specifies a correspondence between two data holders. The two data holders must have the same XSD data type. is used in the EmbeddedParser anchor to specify where a secondary parser should store its result in the output of the main parser. It is used in EmbeddedSerializer to specify how the input data holders of a secondary serializer are related to the input data holders of the main serializer. It is used in EmbeddedMapper for a similar purpose, on both the input and output data holders.
Connect
Example
A secondary parser outputs an XML element called ID. You want the main parser to store this result in a variable called VarID. You can connect ID to VarID. For an additional example, see EmbeddedSerializer on page 178.
Table 8-32. Basic Properties
Property
data_holder embedded_data_holder
Description A data holder that is referenced in the main parser or serializer. A data holder that is referenced in the secondary parser or serializer.
ImageClick
This subcomponent submits a form by simulating a user who clicks an image in the HTML form. The subcomponent is a possible value of the click property in the HtmlForm anchor. The pixel_x and pixel_y properties are useful if the image has an area map. The properties indicate the location in the image where the user clicked.
Table 8-33. Basic Properties
Property
image_name pixel_x pixel_y
Description The name attribute of the image that is specified in the HTML code. The x-coordinate where the user clicked, measured in pixels from the left edge. The y-coordinate where the user clicked, measured in pixels from the top edge.
ModifyField
This is an option of the field_filters property of HtmlForm. It modifies the value of a field that is defined in the HTML code of a form. Optionally, you can define multiple values. The HtmlForm anchor submits all possible combinations of the values with the values of other fields.
Table 8-34. Basic Properties
Property
field_name filter
Description Name of the field. The way to generate the field values. The options are: - ExcludeValues. Removes previously defined values. - UseDataHolder. Assigns a value that is contained in a data holder. To assign multiple values, use a multiple-occurrence data holder. - UseValues. Assigns one or more explicit values.
103
RemoveField
This is an option of the field_filters property of HtmlForm. It removes a field that is defined in the HTML code of a form.
Table 8-35. Basic Properties
Property
field_name
SegmentIndex
This is an option of the part_to_submit property of HtmlForm. It is used to distribute the form submissions between several computers.
SegmentIndex divides the set of field-value combinations into a specified number of portions, and it specifies which portion to submit. On another computer, you can configure a SegmentIndex that submits a different portion.
Description The number of portions into which the combinations should be divided. The portion to submit. 1 means the first portion, and so forth.
SegmentSize
This is an option of the part_to_submit property of HtmlForm. It is used to distribute the form submissions between several computers.
SegmentSize divides the set of field-value combinations into portions of a specified size, and it specifies which portion to submit. On another computer, you can configure a SegmentSize that submits a different portion.
Description The number of combinations in each portion, by default, 2. The portion to submit. 1 means the first portion, and so forth.
SubmitAll
This is an option of the part_to_submit property of HtmlForm. It submits all combinations of the field values from the same computer.
SubmitClick
This component submits a form by simulating a user who clicks a submit button. The component is a possible value of the click property of the HtmlForm anchor.
Table 8-38. Basic Properties
Property
submit_name
104
Chapter 8: Anchors
CHAPTER 9
Transformers
This chapter includes the following topics:
Overview, 105 Defining Transformers, 105 Standard Transformer Properties, 107 Transformer Quick Reference, 109 Transformer Component Reference, 110 Transformer Subcomponent Reference, 133
Overview
Transformers are components that modify data. You can use transformers within components such as anchors, serialization anchors, and actions. The transformers modify the output of the components. For example, if you use a transformer within a Content anchor, it modifies the data that the anchor extracts from the source document. You can also use transformers as document processors or as stand-alone, runnable components. In those cases, the transformers modify the complete content of a document. You can use the transformers supplied with Data Transformation, or you can define custom transformers. This chapter explains how to use transformers and provides detailed information on the transformers available in Data Transformation.
Defining Transformers
You can define transformers in the following locations of the IntelliScript:
In the transformers property of an anchor or a serialization anchor In the default_transformers property of a format or of a serializer In the ProcessByTransformers document processor In the transformers property of certain actions At the global level, as a standalone, runnable component that modifies a source document.
105
The following sections explain the use of transformers in each of these locations.
To do this, you can configure the Content anchors, which retrieve the strings Ron and Lehrer, with the ChangeCase transformer.
Sequences of Transformers
You can configure an anchor with a sequence of transformers. Each transformer modifies the output of the preceding transformer. In the Ron Lehrer example, suppose you want the following output:
<Person> <FirstName>- RON -</FirstName> <LastName>- LEHRER -</LastName> </Person>
To do this, you might configure the Content anchors with the ChangeCase and AddString transformers. The transformers change the case and add the hyphens, in sequence.
Default Transformers
Very often, you want the same transformers to run on all the Content anchors in a parser. You can configure the format component of the parser with default transformers. This saves you the trouble of adding the same transformers to every anchor in the parser. To do this, nest the transformers in the default_transformers property of the format. For more information, see Formats on page 43. Many of the predefined format components include default transformers. For example, the HtmlFormat component has default transformers that remove HTML tags from the output and convert HTML entities to plain text. You can change the default transformers by editing the default_transformers property. If an anchor has its own transformers, they run after the default transformers. You can cancel the default transformers for particular anchors. To do this, set the ignore_default_transformers property of the anchor.
106
Chapter 9: Transformers
To do this, configure the parser format component with the ProcessByTransformers document processor, and nest the transformers within the component.
You can add transformers to the default_transformers property of a serializer. The transformers that you add here run in all the ContentSerializer serialization anchors before they write to the output document.
Set the transformer as the startup component. Click Run> Run. You are prompted to select the source document that the transformer should process. The Events view appears, for you can review the events. The output file is stored in the Results folder of the project. It has a name such as Transformation of filename.txt, where filename is the source file. You can open the file in any suitable application.
Definition A name that you assign to the transformer. Data Transformation displays the name in the event log. This can help you find an event that was caused by the particular transformer. A comment describing the transformer.
remark
107
Property
disabled
Definition If selected, Data Transformation ignores the transformer. This is useful for testing and debugging, or for making minor modifications in a project without deleting the existing transformers. This property means that if the transformer fails, its parent component does not fail. If you deselect the optional property and the transformer fails, it causes the parent component to fail. For more information, see Failure Handling on page 231.
optional
108
Chapter 9: Transformers
Description Converts a relative path or URL to an absolute path. An XML-to-XML transformer that adds empty elements if elements are missing from the XML. Adds strings before and/or after the input text. Converts the base64 MIME encoding to a binary string. Converts a binary string to the base64 MIME encoding. Reverses strings in languages that are written from right to left. Converts big-endian Unicode to little-endian. Decodes a CDATA section of an XML document. Encodes a CDATA section of an XML document. Changes the text to upper case or lower case. Generates a GUID identifier. Generates a UUID identifier. Formats a date or time. Converts the Hebrew 7-bit encoding to the Windows code page. Converts EBCDIC to ASCII text. Encodes spaces and special characters, as required in a URL. Converts text from one code page to another. Runs a custom transformer that is implemented as a DLL. Formats a number by adding a sign, decimal point, leading and trailing zeros, and a unit. Converts a floating point number from binary to an ASCII string. Converts an integer from binary to an ASCII string. Converts a number from packed decimals to an ASCII string. Converts a number from signed decimals to an ASCII string. Reverses Hebrew text from RTL to LTR. Converts from the Hebrew MS-DOS to Windows code page. Converts Hebrew text from EBCDIC to the Windows-1255 code page. Converts Hebrew text from Unicode UTF-16 to the Windows-1255 code page. Converts Hebrew text from UTF-8 to the Windows-1255 code page. Converts HTML entities to plain text. Normalizes whitespace in an HTML document. Inserts a decimal point in a number. Inserts a string into text. Runs a custom transformer that is implemented in Java. Looks up a value in a table. Changes <tag /> to <tag></tag> in XML input. Replaces the text with data retrieved from a database.
AddString Base64Decode Base64Encode BidiConvert BigEndianUniToUni CDATADecode CDATAEncode ChangeCase CreateGuid CreateUUID DateFormatICU Dos96HebToAscii EbcdicToAscii EncodeAsUrl Encoder ExternalTransformer FormatNumber
109
Transformer
RegularExpression RemoveMarginSpace RemoveRtfFormatting RemoveTags Replace Resize ReverseTransformer RtfProcessor RtfToASCII SubString ToFloat ToInteger ToPackDecimal ToSignedDecimal TransformationStartTime TransformByParser TransformByProcessor
Description Modifies the text by using a regular expression. Trims leading and trailing space characters. Removes all RTF formatting characters within the text. Removes HTML tags. Replaces or deletes specified text. Pads text to a specified size. Reverses a string. Normalizes RTF code. Converts RTF input to plain text. Returns a substring of the input. Converts a floating point number from an ASCII string to binary. Converts an integer from an ASCII string to binary. Converts a number from an ASCII string to packed decimals. Converts a number from an ASCII string to signed decimals. Outputs the date and/or time at which the transformation started running. Runs a parser on the input text, replacing segments of the text. Runs a document processor on the input text, converting it to a new document format. Runs a Data Transformation service on the input. Applies a sequence of transformers. Converts text in Western European languages from Unicode UTF-16 to the Windows-1252 code page. Applies an XSLT transformation to XML input text.
XSLTTransformer
AbsURL
This transformer converts a relative file path or URL to an absolute path. For example, if the input is test.html and the base URL is http://www.example.com, the output is http://
www.example.com/test.html.
If the input is an absolute path, the transformer does not alter it.
Table 9-1. Basic Properties
Property
AbsURL
110
Chapter 9: Transformers
Description For more information about these properties, see Standard Transformer Properties on page 107.
AddEmptyTagsTransformer
This is an XML to XML transformer. The transformer checks if all the elements defined in the XSD schema exist in the XML input. If not, it adds empty elements to the XML.
Table 9-3. Basic Properties
Property
root_element
Description The root element of the XML. Select from a Schema view.
Description For more information about these properties, see Standard Transformer Properties on page 107.
remark disabled
AddString
This transformer adds strings before and/or after the input text.
Table 9-5. Basic Properties
Property
pre post
Description The string to add before the text. The string to add after the text.
Description For more information about these properties, see Standard Transformer Properties on page 107.
Online Sample
For an online sample, open samples\Projects\Transformers_Example\Transformers_Example.cmw. The first Content anchor in the parser is configured with an AddString transformer.
111
Base64Decode
This transformer converts the base64 MIME encoding to a binary string.
Table 9-7. Advanced Properties
Property
tolerance
Description This property controls how the transformer processes whitespace characters or non-base64 sections of its input. The default is ignore_white_spaces. Alternatively, you can choose ignore_none or ignore_non_base64. For more information about these properties, see Standard Transformer Properties on page 107.
name
Base64Encode
This transformer converts a binary string to the base64 MIME encoding. This is useful, for example, when you want to save binary data in XML.
Table 9-8. Advanced Properties
Property
name
Description For more information about these properties, see Standard Transformer Properties on page 107.
BidiConvert
This transformer reverses strings that are written in right-to-left (RTL) languages, such as Hebrew and Arabic. The input must be in the RTL format. The output is LTR. The transformer operates on Windows where the default language is RTL. For a similar transformer that runs on all platforms, use hebrewBidi. The two transformers use slightly different algorithms that occasionally give different results.
Table 9-9. Advanced Properties
Property
disabled
Description For more information about this property, see Standard Transformer Properties on page 107.
BigEndianUniToUni
This transformer converts big-endian Unicode to little-endian. The transformer is supported for compatibility with projects that have been upgraded from previous Data Transformation versions. It is not available for use in new projects. Instead, set the byte order in the project properties. For more information, see Encoding Properties on page 218.
112
Chapter 9: Transformers
CDATADecode
This transformer decodes a CDATA section of an XML document. For example, it converts
<![CDATA[100 < 200]]>
to
100 < 200
Note: If you write the result to XML, Data Transformation re-encodes it using the standard XML encoding:
100 < 200
Description For more information about these properties, see Standard Transformer Properties on page 107.
optional
CDATAEncode
This transformer encodes a CDATA section of an XML document. For example, it converts
100 < 200
to
<![CDATA[100 < 200]]>
optional
ChangeCase
The ChangeCase transformer changes a text string to all upper case, all lower case, or only the first letter capitalized. The transformer works on English characters. It may fail on some non-English characters. For example, it does not convert the lower-case German to the upper-case SS.
Table 9-12. Basic Properties
Property
case_type
Description For more information about these properties, see Standard Transformer Properties on page 107.
remark disabled
113
Online Sample
For an online sample, open samples\Projects\Transformers_Example\Transformers_Example.cmw. The third Content anchor in the parser is configured with a ChangeCase transformer.
CreateGuid
This transformer generates a GUID identifier. The GUID is guaranteed to be unique on every generation. The transformer ignores its input. The GUID is not related to the input in any way. The GUIDs may have a non-standard format on UNIX platforms. For a fully UNIX-compatible transformer, use CreateUUID instead.
CreateUUID
This transformer generates a UUID identifier. The UUID is guaranteed to be unique on every generation and is compatible with both Windows and UNIX platforms. The transformer ignores its input. The GUID is not related to the input in any way.
Table 9-14. Advanced Properties
Property
name
Description For more information about these properties, see Standard Transformer Properties on page 107.
DateFormatICU
This transformer formats a date or time.
Example
Suppose you configure a DateFormat transformer with:
input_format = "d/M/yy" output_format = "MM/dd/yyyy"
If the input is
13/3/05
the output is
03/13/2005
Supported Formats
The transformer uses the ICU conventions to represent the date and time format. The following table lists the symbols that you can use in the format patterns. For more information, see: http://icu.sourceforge.net/apiref/icu4c/classSimpleDateFormat.htm l
Pattern Symbol
G y
Examples
AD 1996
114
Chapter 9: Transformers
Pattern Symbol
u M
Meaning Extended year Month in year Day in month Hour in AM/PM (1-12) Hour in day (0-23) Minute in hour Second in minute Fractional second Day of week Day of week (local 1-7) Day in year Day of week in month Week in year Week in month AM/PM marker Hour in day (1-24) Hour in AM/PM (0-11) Time zone Time zone (RFC 822) Time zone (generic) Julian day Milliseconds in day The text within single quotes is interpreted as a literal string Literal single quote
Type Number Text or number Number Number Number Number Number Number Text Number Number Number Number Number Text Number Number Time Number Text Number Number Text
Examples
-200, meaning 201 BC July 07 10 12 0 30 55 978 Tuesday 2 189 2, meaning the 2nd Wednesday in July 27 2 PM 24 0 Pacific Standard Time -0800 Pacific Time 2451334 69540000 'Today is 'dd/MM/yyyy
d h H m s S E e D F
w W a k K z Z v g A ' '
''
Text
'o''clock'
For text: Four or more pattern symbols means to use the full form. Fewer than four means to use a short or abbreviated form if it exists. For example, if EEEE produces Monday, EEE produces Mon. For numbers: The number of pattern symbols is the minimum number of digits. Shorter numbers are zeropadded. For example, if m produces 6, mm produces 06. For years: The two-digit year is yy, and the four-digit year is yyyy. For example, if yy produces 05, yyyy produces 2005. For months: If M produces 1, then MM produces 01, MMM produces Jan, and MMMM produces January.
115
All non-alphabetic characters are interpreted as literals, even if they are not enclosed in single quotes. For example, dd/MM/yyyy HH:mm produces 15/03/2005 13:15, containing the /, space, and : characters.
Table 9-15. Basic Properties
Property
input_format
Description The format of the input date, for example, d/M/yy. You can type the format, or browse to a data holder that contains the format. The format of the output date, for example, MM/dd/yyyy. You can type the format, or browse to a data holder that contains the format.
output_format
Description For more information about these properties, see Standard Transformer Properties on page 107.
optional
Note: If you open a project that was created in a previous Data Transformation version, you might observe that
it uses the older DateFormat processor. This component is supplied for backwards compatibility. In new projects, use DateFormatICU.
Dos96HebToAscii
This transformer converts the Hebrew 7-bit encoding to the Windows-1255 code page.
EbcdicToAscii
This transformer converts EBCDIC to ASCII text.
EncodeAsUrl
This transformer encodes spaces and special characters as required in a URL. The characters are encoded as hexadecimal preceded by a % symbol.
Note: The parentheses characters are not encoded. They are displayed as ( and ).
to
http://www.example.com?name=Ron%20Lehrer
Description For more information about these properties, see Standard Transformer Properties on page 107.
remark disabled
Online Sample
For an online sample, open samples\Projects\Transformers_Example\Transformers_Example.cmw. The fourth Content anchor in the parser is configured with an EncodeAsUrl transformer.
116
Chapter 9: Transformers
Encoder
This transformer converts text from one code page to another.
Table 9-18. Basic Properties
Property
input_code_page output_code_page
Description Adds a Byte Order Mark (BOM) when the output encoding is UTF-8 or UTF-16. For more information about these properties, see Standard Transformer Properties on page 107.
ExternalTransformer
This component allows you to run a custom transformer that is implemented as a C++ DLL.
Note: This component is supported for backwards compatibility with existing custom transformers. For more
information about custom transformers and other external components, see the Data Transformation Engine Developer Guide. The following instructions are for the Microsoft Visual C++ compiler, running on a Microsoft Windows platform.
To create a custom C++ transformer: 1.
Copy the following file from the Data Transformation installation folder:
samples\SDK\ExternalTransformer\ExternalTransformerExample.c
2. 3.
Using the Visual C++ compiler, create a Win32 dynamic-link library project, and insert the C file into the project. Edit the following function, which performs the transformation:
__declspec(dllexport) int transform(const char* in, char** out)
In the sample implementation, the function reverses the text. Replace the sample code with your implementation.
4.
5. 6. 7.
Compile the DLL. Store the DLL in the externLibs\user subfolder of the Data Transformation installation folder. Define an ExternalTransformer that references the DLL.
117
8.
Optionally, add the ExternalTransformer to the component list that Data Transformation Studio displays. For more information about customizing the component list, see Using Data Transformation Studio in Eclipse.
Description For more information about these properties, see Standard Transformer Properties on page 107.
FormatNumber
This transformer formats a number by adding a sign, decimal point, leading or trailing zeros, and unit.
Table 9-22. Basic Properties
Property
sign
Description Adds a plus or minus sign at the beginning or end of the number. The options are: - un_signed. Deletes a sign if present. - leading_sign. - trailing_sign. - negative sign only. - as in source. Does not change the input sign. Sets the decimal point symbol. The options are none, point, and comma. Adds a unit after the number. Select a unit such as meter, cm, mm, or inch. If you do not want to add a unit, select undefined. Pads the integer part with leading zeros to the indicated size. Pads the decimal part with trailing zeros to the indicated size.
insert_decimal_point unit_type
size_of_integer_part number_of_decimals
Description For more information about these properties, see Standard Transformer Properties on page 107.
118
Chapter 9: Transformers
FromFloat
This transformer converts a floating point number from binary to an ASCII string representation. The conversion is performed in the input encoding with the input byte-order.
Table 9-24. Advanced Properties
Property
size name
Description Size of the number: single_precision_32_bit or double_precision_64_bit. For more information about these properties, see Standard Transformer Properties on page 107
FromInteger
This transformer converts an integer from binary to an ASCII string representation, in decimal, octal, or hexadecimal. The conversion is performed in the input encoding with the input byte-order.
Table 9-25. Basic Properties
Property
size
Description Size in bytes of the binary representation. The supported values are 1 to 8.
Description If selected, the transformer adds a sign to the number. The base of the output: decimal, octal, hexadecimal, lowercase hexadecimal. For more information about these properties, see Standard Transformer Properties on page 107.
FromPackDecimal
This transformer converts a number from packed decimals to an ASCII string representation. The conversion is performed in the input encoding with the input byte-order.
Table 9-27. Advanced Properties
Property
name
Description For more information about these properties, see Standard Transformer Properties on page 107.
119
FromSignedDecimal
This transformer converts a number from signed decimals to an ASCII string representation. The conversion is performed in the input encoding with the input byte-order.
Table 9-28. Advanced Properties
Property
insert_sign_symbol name
Description Adds a plus or minus sign before or after the number. The options are no, before, and after. For more information about these properties, see Standard Transformer Properties on page 107.
hebrewBidi
This transformer reverses strings that are written in right-to-left (RTL) languages, such as Hebrew and Arabic. The input must be in the RTL format. The output is LTR.
HebrewDosToWindowsTransformer
This transformer converts Hebrew documents from the MS DOS Hebrew code page to the Windows Hebrew code page.
HebrewEBCDICOldCodeToWindows
This transformer converts Hebrew text from EBCDIC to the Windows-1255 code page.
hebUniToAscii
This transformer converts Hebrew text from Unicode UTF-16 to the Windows-1255 code page.
hebUtf8ToAscii
This transformer converts Hebrew text from Unicode UTF-8 to the Windows-1255 code page.
HtmlEntitiesToASCII
This transformer converts HTML entities to plain text. For example, it converts © or © to a copyright symbol ( ).
Supported Entities
The transformer supports the ISO 8859-1 (Latin-1) entities that are defined in the HTML 4.0 reference, http:/ /www.w3.org/TR/1998/REC-html40-19980424/sgml/entities.html. The supported entities include:
&, <, >,
Numeric character codes � to ÿ Entities for Latin-1 characters: = non-breaking space, © = copyright, etc.
The transformer does not support extended characters, that is, codes greater than 255 or non-Latin-1 characters.
120
Chapter 9: Transformers
Description For more information about these properties, see Standard Transformer Properties on page 107.
remark disabled
HtmlProcessor
This transformer normalizes whitespace according to HTML conventions. It converts any sequence of tabs, line breaks, and space characters to a single space character. You can use this transformer to normalize whitespace in any type of text. It is not restricted to HTML text. The component can also be used as a format preprocessor. For more information, see Format Preprocessor Component Reference on page 52.
InjectFP
This transformer inserts a decimal point at a specified location in a number. For example, the transformer can convert 12345 to 123.45.
Table 9-30. Basic Properties
Property
digits_after_decimal_point
Description For more information about these properties, see Standard Transformer Properties on page 107.
InjectString
The InjectString transformer inserts a string into text.
Table 9-32. Basic Properties
Property
injection_place
Description The location in the text to insert the string. 0 means to insert the string before the text. The string to insert.
string_to_inject
121
Description For more information about these properties, see Standard Transformer Properties on page 107.
remark disabled
JavaTransformer
This component allows you to run a custom transformer that is implemented in Java.
Note: This component is supported for backwards compatibility with existing custom transformers. For more
information about custom transformers and other external components, see the Data Transformation Engine Developer Guide.
To create a custom Java transformer: 1. 2. 3.
Create a new Java project and package, for example, named MyTransformer Create a class, for example, named TransformerTest. In the class, define a method having the following syntax. The method can have any name.
public static byte[] Transform(byte[] in)
4. 5. 6. 7.
Create a jar file containing the class. Store the jar file in the externLibs\user subfolder of the Data Transformation installation folder. Define a JavaTransformer that references the class and method. Optionally, add the JavaTransformer to the component list that Data Transformation Studio displays. For more information about customizing the component list, see Using Data Transformation Studio in Eclipse.
Example
The following example is a transformer that changes text to upper case.
package MyTransformer; public class TransformerTest { public static byte[] Transform(byte[] in) { String str = new String(in); String ret = str.toUpperCase(); return ret.getBytes(); } }
Description The path of the Java class, for example, MyTransformer/TransformerTest. The method to run, for example, Transform.
Description For more information about these properties, see Standard Transformer Properties on page 107.
optional
122
Chapter 9: Transformers
LookupTransformer
This transformer looks up a value in a table. For example, you can configure a LookupTransformer to look up values in the following table:
Key
1 2 3 4
Value
George Washington John Adams Thomas Jefferson James Madison
If the input of the transformer is 3, the output is Thomas Jefferson. There are three ways to define a lookup table:
Inline lookup table. The table is fully defined in the IntelliScript. XML lookup table stored in a file. The table is defined in an XML file that the LookupTransformer retrieves. XML lookup table generated dynamically. The transformation prepares an XML lookup table at runtime and stores it in a data holder.
123
For more information about configuring a secondary parser that generates a dynamic lookup table, see AdditionalInputPort on page 16.
Table 9-36. Basic Properties
Property
look_at
Description Under this property, you define the type of lookup table used by the transformer. Select one of the following values: - InlineTable - XMLLookupTable - DynamicTable If you use the same lookup table repeatedly, consider defining an InlineTable or an XMLLookupTable at the global level of the IntelliScript. You can then reference the table by name in the look_at property.
Description For more information about these properties, see Standard Transformer Properties on page 107.
NormalizeClosingTags
For XML input, this transformer removes shorthand closing tags from empty elements. It changes <tag/> to <tag></tag>. The transformer does not correct incorrect XML. It converts well-formed XML from one style of closing tag to another.
Table 9-38. Advanced Properties
Property
name
Description For more information about these properties, see Standard Transformer Properties on page 107.
remark disabled
ODBCLookup
The ODBCLookup transformer uses the input text to query a database. It replaces the text with the query result.
Table 9-39. Basic Properties
Property
db_connection
Description The database connection. The value is an ODBC_Text_Connection subcomponent, which specifies a DSN, user name, and any other required connection parameters.
124
Chapter 9: Transformers
Description A SQL SELECT or EXEC query that retrieves the data from the database. Use ? to represent the input text, for example:
SELECT Name FROM Employees WHERE Id=?
The query must retrieve a single field, which is the transformer output.
retry disabled
The number of retries if the first connection attempt fails. For more information about these properties, see Standard Transformer Properties on page 107.
RegularExpression
The RegularExpression transformer performs a pattern search on the input text. It replaces instances of the pattern with a specified string. For example, suppose that a Content anchor retrieves the text:
transformer
You configure the anchor with a RegularExpression transformer that searches for the pattern t.+s. The pattern means the letter t, followed by one or more of characters, followed by the letter s. You configure the transformer to replace the pattern with the character X. The pattern matches the substring trans of the input. The transformer replaces the substring and outputs:
Xformer
Description A regular expression for the search criterion. The replacement text. Leave blank to delete the found text.
Description For more information about these properties, see Standard Transformer Properties on page 107.
Regular Expressions
A regular expression is a string that uses a standard syntax to define a search pattern. The syntax is similar to a wildcard search, but with greatly enhanced search capabilities. Consider, for example, the following text:
Peter Piper picked a peck of pickled peppers
125
The following are illustrations of regular expressions and the strings that they retrieve from the above text.
Regular expression
Peter Pip.+cked
Retrieved string
Peter Piper picked
Explanation Retrieves the literal string. Retrieves a string beginning with Pip, followed by at least one character, followed by cked. Retrieves a string beginning with P or p, followed by i, followed by exactly 3 characters between a and k. Finds picke because the string matches these criteria. Does not find Piper or pickl because the letters p and l do not lie between a and k.
[Pp]i[a-k]{3}
picke
The following table lists special characters that you can use in regular expressions. The table is not comprehensive. There are other syntax combinations that have special meanings in regular expressions.
Character
*
Meaning Match zero or more instances of the preceding character Match zero or one instance of the preceding character Match one or more instances of the preceding character Matches the specified number of instances of the preceding character Match any of a set of characters Use inside [] to represent a range of characters
Example
ab*c matches ac, abc, or abbbc.
a+ matches a or aaaa.
{}
[] -
a[bst]c matches abc, asc, or atc. [A-Za-z] matches any character in the English
alphabet.
[A-Za-z] matches any character in the English alphabet or the German .
. ^
Match any character Match the start of the input text Match the end of the input text Match either of two expressions Grouping Escapes one of the other special characters, treating it as a literal character
a.c matches abc, a c, or a1c. ^P. matches Pe but not Pi in the Peter Piper example. r.$ matches rs in the Peter Piper example. abc|ded matches abc or def. A(abc|def) matches Aabc or Adef. \. matches a literal period, rather than any
$ | () \
character.
Data Transformation uses the Regex++ implementation of regular expressions, copyright 1998-2003 by Dr. John Maddock, Version 1.33, 18 April 2000. For detailed information about the regular expression syntax supported by this implementation, see http://www.boost.org/libs/regex/doc/index.html. For general information about regular expressions, see http://en.wikipedia.org/wiki/Regular_expression and http://www.regular-expressions.info/index.html.
to identify the entire text that matches the regular expression to identify the substring that matches the first parenthesized portion of the regular expression
126
Chapter 9: Transformers
$2, $3,
and so forth, to identify the substrings that match the second, third, etc. parenthesized portions
RemoveMarginSpace
This transformer deletes leading and trailing space characters from the text.
Table 9-43. Advanced Properties
Property
name
Description For more information about these properties, see Standard Transformer Properties on page 107.
remark disabled
RemoveRtfFormatting
The transformer removes RTF formatting instructions from the text.
Table 9-44. Advanced Properties
Property
name
Description For more information about these properties, see Standard Transformer Properties on page 107.
remark disabled
RemoveTags
This transformer removes HTML tags from the input text. It replaces the tags at internal locations in the text with a separator string, such as a space character. It does not insert the separator string at the beginning or end of the text. Adjacent multiple tags are transformed into a single separator.
Table 9-45. Basic Properties
Property
replace_with
Description For more information about these properties, see Standard Transformer Properties on page 107.
remark
127
Description
Replace
This transformer finds and replaces strings in the input text. Leaving the replace_with property empty deletes the found text.
Table 9-47. Basic Properties
Property
find_what
Description The text to find. The value is one of the following searcher components: - NewlineSearch. Finds a newline character. - PatternSearch. Finds text that matches a regular expression. - SegmentSearch. Finds a segment from a specified opening marker to a closing marker. - TextSearch. Finds a specified string. For more information, see the Searcher Component Reference on page 98. The replacement string.
replace_with
Description Specifies which occurrences to replace: all, first, or last. For more information about these properties, see Standard Transformer Properties on page 107.
Online Sample
For an online sample, open samples\Projects\Transformers_Example\Transformers_Example.cmw. The second and fifth Content anchors in the parser are configured with Replace transformers.
Resize
This transformer fits the input text to a specified size. It pads or truncates the text as required.
Table 9-49. Basic Properties
Property
size
Description The desired size. Type an integer, or click the browse button and select a data holder that contains an integer. The padding character, such as a space character. Type the character, or click the browse button and select a data holder that contains a character. The text alignment within the resized string. The options are: - left. Padding or trimming is on the right. - right. Padding or trimming is on the left.
padding_character
align
128
Chapter 9: Transformers
ReverseTransformer
This transformer reverses a string. For example, it transforms 1234 to 4321.
RtfProcessor
This transformer normalizes RTF code. It is also available as a format preprocessor. For more information, see Format Preprocessor Component Reference on page 52.
RtfToASCII
This transformer converts RTF input to plain text. It removes RTF control words from the text.
Table 9-50. Advanced Properties
Property
name
Description For more information about these properties, see Standard Transformer Properties on page 107.
remark disabled
SubString
This transformer returns a substring of the input, starting and ending at specified locations.
Table 9-51. Basic Properties
Property
begin end
Description The start location. 0 means to start at the beginning of the input. The end location.
Description For more information about these properties, see Standard Transformer Properties on page 107.
remark disabled
ToFloat
This transformer converts a floating point number from an ASCII string representation to binary. The conversion is performed in the output encoding with the output byte order.
Table 9-53. Advanced Properties
Property
size name
Description Size of the number: single_precision_32_bit or double_precision_64_bit. For more information about these properties, see Standard Transformer Properties on page 107.
129
ToInteger
This transformer converts an integer from an ASCII string representation to a binary integer. The string input can be a decimal, octal, or hexadecimal representation. The conversion is performed in the output encoding with the output byte order.
Table 9-54. Basic Properties
Property
size
Description Size in bytes of the binary representation. The supported values are 1 to 8.
Description If selected, the input has a plus or minus sign. The base of the input: decimal, octal, hexadecimal, lowercase hexadecimal. For more information about these properties, see Standard Transformer Properties on page 107.
ToPackDecimal
This transformer converts a number from an ASCII string representation to packed decimals. The conversion is performed in the output encoding with the output byte order.
Table 9-56. Advanced Properties
Property
name
Description For more information about these properties, see Standard Transformer Properties on page 107.
ToSignedDecimal
This transformer converts a number from an ASCII string representation to signed decimals. The conversion is performed in the output encoding with the output byte order.
Table 9-57. Advanced Properties
Property
insert_sign_ symbol
Description Adds a sign symbol, plus or minus, before or after the number. The options are no, before, and after. For more information about these properties, see Standard Transformer Properties on page 107.
name
130
Chapter 9: Transformers
TransformationStartTime
This transformer outputs the date and/or time at which the transformation started running. The transformer ignores its input. It copies the date and time from the VarSystem variable, and it formats the output according to your specification.
Table 9-58. Basic Properties
Property
format
Description The format of the date and time. You can type the format or browse to a data holder that contains the format. For more information about the supported formats, see the DateFormatICU transformer.
Description For more information about these properties, see Standard Transformer Properties on page 107.
optional
TransformByParser
This transformer runs a parser on its input text. The parser must contain FindReplaceAnchorcomponents that mark segments of the text for replacement. When the parser completes execution, the transformer performs the replacements. The transformer output is the modified text. Data Transformation ignores any XML output that the parser generates. For more information about this transformer, see FindReplaceAnchor on page 90.
Table 9-60. Basic Properties
Property
parser
Description For more information about these properties, see Standard Transformer Properties on page 107.
Online Sample
For an online sample, open samples\Projects\TransformByParser\TransformByParser.cmw. The sample uses TransformByParser to replace every instance of the string ~NL~ with a carriage return followed by a linefeed.
To run the TransformByParser sample: 1. 2. 3.
Set MyTransformByParser as the startup component. Run the transformer. At the prompt, select the source file Report.edi.
131
The transformer stores its output in Results\Transformation of Report.edi. You can compare the output with the source in Notepad.
TransformByProcessor
This transformer runs a document processor on its input. The output of the transformer is the output of the document processor. For more information, see Document Processors on page 23. For example, you can use the transformer to convert an Excel document to text by invoking the ExcelToTxt document processor. The input of the transformer must in a valid Excel format.
Table 9-62. Basic Properties
Property
processor
Description For more information about these properties, see Standard Transformer Properties on page 107.
TransformByService
This transformer runs a Data Transformation service on its input. The output of the transformer is the output of the service. For more information, see Deploying Data Transformation Services on page 235. For example, if you use the transformer to invoke a parser service, the output of the transformer is an XML string. The transformer supports single-input services. Do not use it with a service that has multiple input ports.
Table 9-64. Basic Properties
Property
service_name
Description If selected, Data Transformation does not apply the input and output encodings that are defined in the service. For more information, see Encoding Properties on page 218. A list of initial values that Data Transformation should assign to variables defined in the service. In each element of the list, specify the name of a variable and its value. For more information, see Initializing Variables at Runtime on page 66. For more information about these properties, see Standard Transformer Properties on page 107.
parameters
disabled
optional
132
Chapter 9: Transformers
TransformerPipeline
This transformer applies a sequence of nested transformers to its input.
Table 9-66. Advanced Properties
Property
name
Description For more information about these properties, see Standard Transformer Properties on page 107.
WestEuroUniToAscii
This transformer converts text in western-European languages from Unicode UTF-16 to the Windows-1252 code page. The transformer is supported for backwards compatibility with projects that have been upgraded from previous Data Transformation versions. It is not available for use in new projects. Instead, set the encoding in the project properties. For more information, see Encoding Properties on page 218.
XSLTTransformer
This transformer applies an XSLT transformation to XML input text. For example, you might use a parser to extract data from an XML document. A Content anchor retrieves a complete, well-formed branch of the XML tree. You can configure the Content anchor with an XSLTTransformer that runs an XSLT transformation on the branch.
Table 9-67. Advanced Properties
Property
xslt_file name
Description The path and filename of the XSLT file. For more information about these properties, see Standard Transformer Properties on page 107.
133
InlineTable
This component lets you define a lookup table in the IntelliScript. The table is used by the LookupTransformer. For example, you might specify:
Key
1 2 3 4
Value
George Washington John Adams Thomas Jefferson James Madison
Description Under this property, enter a sequence of entry components. In each entry, specify key and value strings.
ODBC_Text_Connection
The subcomponent defines a database connection. It is used, for example, in the ODBCLookup transformer. Before using this component, use the operating system tools to define a DSN for the database connection.
Table 9-70. Basic Properties
Property
DSN
Description User name for the database connection. Password of the user. Time in seconds to wait for the database response.
XMLLookupTable
This component lets you specify an XML file that contains a lookup table. The table is used by the LookupTransformer. Prepare an XML file conforming with the schema lookupTableDefinition.xsd. The schema is stored in the doc subdirectory of the Data Transformation installation directory. The following XML document is an example:
<?xml version="1.0" encoding="windows-1252" ?> <lt:LookupTable xmlns:lt="http://www.Itemfield.com/Engine/V4/lookupTable" matchCase="false"> <lt:Entry key="1" value="George Washington" /> <lt:Entry key="2" value="John Adams" /> </lt:LookupTable>
134
Chapter 9: Transformers
If the optional matchCase attribute is true, the key attribute is considered to be case-sensitive.
Table 9-72. Basic Properties
Property
xml_file_name
135
136
Chapter 9: Transformers
CHAPTER 10
Actions
This chapter includes the following topics:
Overview, 137 Standard Action Properties, 138 Action Quick Reference, 139 Action Component Reference, 139 Action Subcomponent Reference, 161
Overview
Actions are components that perform operations on data that Data Transformation has extracted from a source document. Some examples of the supported actions are:
Arithmetic computations String concatenations Submitting forms to a web server Activating a secondary parser, serializer, or mapper Querying a database
You can use the out-of-the-box actions supplied with Data Transformation, or you can define custom actions. This chapter explains how to use actions and documents the actions that are available in Data Transformation.
137
An action can have additional effects, such as writing to a file, updating a database, or submitting data to an external application.
Defining Actions
You can define actions by editing the IntelliScript. You can insert the actions under the contains line of components such as a Parser, Serializer, Mapper, Group, or RepeatingGroup. Essentially, you can insert actions in any location where you can insert anchors, serialization anchors, or mapper anchors. The actions run in sequence with the anchors that you specify in the same location. In a parser, you can set the phase property of an action, which controls whether it runs in the initial, main, or final stage of the parsing procedure. For more information, see Search Phases on page 77.
Description A name that you assign to the action. Data Transformation displays the name in the event log. This can help you find an event that was caused by the particular action. A comment describing the action. If selected, Data Transformation ignores the action. This is useful for testing and debugging, or for making minor modifications in a project without deleting the existing actions. By default, if an action fails, its parent component fails. If you select the optional property, the parent component does not fail. For more information, see Failure Handling on page 231. The processing phase during which Data Transformation should execute the action: initial, main, or final. This property has an effect only if the action is used in a parser.
remark disabled
optional
phase
138
Description Writes a message in the event log. Concatenates a list of strings stored in a multiple-occurrence data holder. Concatenates strings. Performs a computation defined in a JavaScript expression. Generates all possible concatenations from multiple-occurrence data holders. Fills a multiple-occurrence data holder with specified data. Writes a custom log message. Increments a date. Computes the difference between two dates. Downloads a file. Downloads the content of a file into a data holder. A debugging tool for dumping extracted data. Evaluates a JavaScript expression. If the expression is false, the action fails. Deletes values from a multiple-occurrence data holder. Runs a custom action that is implemented as an ActiveX DLL. Runs a JavaScript function. Copies a data holder, optionally running transformers on the value. Runs a database query. Resets the list of visited pages, permitting repeat visits to a page. Runs a mapper. Runs a parser. Runs a serializer. Fills a data holder with predefined content. Sorts the occurrences of a multiple-occurrence data holder. Submits an HTML form using the Post method and parses the response. Submits an HTML form using the Get method and parses the response. Writes a value to a location such as a file or a string-type data holder. Runs an XSLT transformation.
139
AddEventAction
This action adds a message to the event log.
Table 10-1. Basic Properties
Property
severity
Description The severity level of the message. The options are notification, warning, failure, or fatal error. The message string.
message
Description For more information about these properties, see Standard Action Properties on page 138.
AppendListItems
The AppendListItems action concatenates the strings in a multiple-occurrence data holder. For more information about preparing the input for this action, see Mapping to Multiple-Occurrence Data Holders on page 73.
Example
A source document contains the following space-separated text:
H E L L O
When you parse the document, you want to remove the spaces and store the result in an XML element called Greeting. One way to do this is to create a multiple-occurrence variable called VarLetter. Create several Content anchors that retrieve the individual letters and store them in occurrences of VarLetter. Then, use the AppendListItems action to concatenate the occurrences of VarLetter and store the result in the Greeting element. The result is:
<Greeting>HELLO</Greeting>
Description The multiple-occurrence data holder. The data holder must have a simple XSD type. A data holder to store the output. The data holder must have a simple XSD type.
Description For more information about these properties, see Standard Action Properties on page 138.
140
Online Sample
For an online sample of this action, open the project samples\Projects\AppendListItems\ AppendListItems.cmw. The sample uses a RepeatingGroup to store values in a multiple-occurrence variable. It then uses as an AppendListItems action to concatenate the values.
AppendValues
The AppendValues action concatenates strings.
Example
A parser has generated the following XML:
<Name> <First>Ron</First> <Last>Lehrer</Last> <Name>
Description A list of data holders containing the values to be appended. The data holders must have simple XSD types. A data holder to store the output. The data holder must have a simple XSD type.
output
Description If selected, and one of the input data holders is missing, the action continues. If not selected, the action fails. For more information about these properties, see Standard Action Properties on page 138.
name
CalculateValue
The CalculateValue action performs a computation that is defined by a JavaScript-like expression. For example, you can use the action to compute a sum of numerical values or to concatenate string values.
Note: For more information about the JavaScript syntax that Data Transformation supports, see
EnsureCondition.
For more information about how Data Transformation handles the precision of xs:decimal and xs:float values, see Precision of Numerical Data on page 58.
Example
A parser has generated the following XML:
<ItemOrdered> <Name>Gizmo</Name> <Quantity>100</Quantity> <Price>25<Price>
141
</ItemOrdered>
$2,
To do this, define the Name and Quantity elements as input parameters. Specify the JavaScript expression $1 * and store the result in the Total element.
Table 10-7. Basic Properties
Property
params expression result
Description Data holders that contain the input parameters. The JavaScript expression. Use $1, $2,... $9, to represent the input parameters. A data holder to store the output.
Description The behavior in the event of a failure. The options are: - Ignore. Continue the transformation. - HaltExecution. Stop the transformation. For more information about these properties, see Standard Action Properties on page 138.
Online Sample
For an online sample of this action, open the project samples\Projects\CalculateValue\ CalculateValue.cmw. The sample retrieves three numbers from a source document and stores them in variables. It uses a CalculateValue action to compute a mathematical function of the numbers.
CombineValues
The CombineValues action generates all possible combinations from lists of strings that are stored in multipleoccurrence data holders. It concatenates the strings in each combination, generating an output list. The input of this action must include one or more multiple-occurrence data holders. Optionally, it may also include single-occurrence data holders. For more information, see Multiple-Occurrence Data Holders on page 67. The output is a multiple-occurrence data holder. Each occurrence of the data holder stores a combination.
Example
In a multiple-occurrence variable called VarDay, you have stored the list Monday, Tuesday. In a multipleoccurrence variable called VarTime, you have stored morning, afternoon. In a single-occurrence variable called VarSpace, you have stored a space character. Suppose you run CombineValues on VarDay, VarSpace, and VarTime, with an output data holder called DayTime. The output is:
142
Description From a Schema view, select the data holders containing the input. Typically, at least one of the inputs should be a multiple-occurrence data holder. The data holders must have simple XSD types. A multiple-occurrence data holder where the action stores its output. The data holder must have a simple XSD type.
output
Description For more information about these properties, see Standard Action Properties on page 138.
Online Sample
For an online sample of this action, open the project samples\Projects\CombineValues\CombineValues.cmw. The sample retrieves lists of days, months, and years from a source document. It uses a CombineValues action to generate all possible dates from the lists.
CreateList
This action inserts data in a list. The output is a multiple-occurrence data holder containing the list. For more information, see Multiple-Occurrence Data Holders on page 67. Nested in this component, enter the data values.
Example
If the input data values are
Jack Jennie Larissa
143
</Name>
Description The multiple-occurrence data holder where the action should store the list. The data holder must have a simple XSD type.
Description For more information about these properties, see Standard Action Properties on page 138.
CustomLog
This component can be used as the value of the on_fail property. In the event of a failure, the CustomLog component runs a serializer that prepares a log message. The system writes the message to a specified location. For more information about the on_fail property, see Failure Handling on page 231.
Note: The MSMQ and COM output options ( MSMQOutput and OutputCOM) are supported for compatibility with projects created in earlier Data Transformation versions. The options are being phased out of the Data Transformation system. Do not use them in new projects.
Table 10-13. Basic Properties
Property
run_serializer
Description A serializer that prepares the log message. Define a serializer in this location, or enter the name of a globally defined serializer.
Description The output location. The options include: - MSMQOutput. Writes to an MSMQ queue. - OutputDataHolder. Writes to a data holder. - OutputFile. Writes to a file. - ResultFile. Writes to the default results file of the transformation. - OutputCOM. Uses a custom COM component to output the data. Do not select this option directly. Instead, select the display name of the custom COM component. For more information about these options, see Action Subcomponent Reference on page 161. In addition, you can choose: - OutputPort. The name of an AdditionalOutputPort where the data is written. For more information, see Ports on page 15. - StandardErrorLog. Writes to the user log. For more information, see Failure Handling on page 231.
144
DateAddICU
This action increments a date.
Table 10-15. Basic Properties
Property
input_format
Description The date format, for example, dd/MM/yy. You can type the format or browse to a data holder containing the format. If you omit the format, the system default is assumed. For more information, see DateFormatICU on page 114. The date to be incremented. You can type the date or browse to a data holder containing the date. The number of days to add. You can type a positive or negative integer or browse to a data holder containing the number. The data holder to store the output date.
input_date
num_of_days
output
Description For more information about these properties, see Standard Action Properties on page 138.
Note: DateAddICU replaces the DateAdd component used in previous versions of Data Transformation. Existing transformations that use the DateAdd component continue to run without change.
DateDiffICU
This action computes the difference between two dates.
Table 10-17. Basic Properties
Property
date_format1 date_format2
Description The formats of the two dates, for example, dd/MM/yy. You can type the format, or you can browse to a data holder containing the format. If you omit the format, the system default is assumed. For more information, see DateFormatICU on page 114. The two dates. You can type the date or browse to a data holder containing the date. The data holder to store the difference, in days.
Description For more information about these properties, see Standard Action Properties on page 138.
Note: DateDiffICU replaces the DateDiff component used in previous versions of Data Transformation. Existing transformations that use the DateDiff component continue to run without change.
145
DownloadFile
Note: This component is provided for compatibility with projects created in earlier Data Transformation
versions. It is being phased out of the Data Transformation system. Do not use it in new projects. This action downloads a file to the local computer. The file path or URL is specified in a data holder, which a transformation might retrieve dynamically from a source document.
Table 10-19. Basic Properties
Property
file_url target_path
Description A data holder that stores the file path or URL. The folder path to store the downloaded file. If you leave the property blank, the file is stored in the Results folder of the project.
Description A sequence of transformers that the action applies to the path or URL string before downloading. For more information about these properties, see Standard Action Properties on page 138.
Online Sample
For an online sample of this action, open the project samples\Projects\DownloadFile\DownloadFile.cmw. To run the sample, you must have an Internet connection. The sample retrieves the URL of a file. It then uses the DownloadFile action to download the file to the Results folder of the project.
DownloadFileToDataHolder
Note: This component is provided for compatibility with projects created in earlier Data Transformation
versions. It is being phased out of the Data Transformation system. Do not use it in new projects. This action downloads a file from a web server and stores its content in a data holder. If the file contains symbols such as < and >, the action converts them to XML entities such as < and >.
Table 10-21. Basic Properties
Property
file_url output
Description A data holder that stores the URL of the file. The data holder to store the downloaded content.
Description For more information about these properties, see Standard Action Properties on page 138.
146
DumpValues
This action is a debugging tool. It writes data to a <DumpValues>...</DumpValues> element. Nested in the action, insert the data holders that should be dumped.
Table 10-23. Advanced Properties
Property
output
Description The file in which to write the output. The options are: - ResultFile. The default output file of the project. - OutputFile. Specify a path. For more information about these properties, see Standard Action Properties on page 138.
EnsureCondition
This action evaluates a JavaScript expression. If the expression is false, the action fails.
Table 10-24. Basic Properties
Property
condition
Description A JavaScript expression to be evaluated. In the expression, use $1, $2, ... $9, to refer to the params. For example, the following expression checks whether the first parameter has the value Ron Lehrer:
$1 == "Ron Lehrer"
params
A list of data holders, containing parameters that you can use in the condition.
Description For more information about these properties, see Standard Action Properties on page 138.
JavaScript Syntax
The Data Transformation JavaScript processor supports standard JavaScript expressions containing the following features.
Note: Information about JavaScript syntax is available in many books about web development. For a tutorial
147
If you apply these methods to a literal having a simple data type, you must enclose the literal in parentheses, for example:
123.toString(); (123).toString(); "Hello, World".substring(3,7); ("Hello, World").substring(3,7); //Wrong //Right //Wrong //Right
The JavaScript processor does not support features such as the following:
Assignment operators:
= += -= *= /= >>= >>>= <<= &= |= ^=
The comma operator (,). The values NaN, null, infinity, or -0 (negative 0). Data types other than string, number, and boolean. The Date object. The equalsIgnoreCase function.
Note: Earlier, 32-bit Data Transformation versions included an external JavaScript processor that supported
additional JavaScript features. The external processor does not run on 64-bit platforms, and it is no longer included in the Data Transformation setup. For compatibility with projects created in previous versions, running on 32-bit platforms, you can request a copy of the external processor from Informatica. Do not use the external processor in new projects.
ExcludeItems
This action deletes specified values from a multiple-occurrence data holder. For more information, see Multiple-Occurrence Data Holders on page 67. Nested in the action, specify the values to exclude.
Table 10-26. Basic Properties
Property
data_holder
Description The multiple-occurrence data holder. The data holder must have a simple XSD type.
Description For more information about these properties, see Standard Action Properties on page 138.
148
ExternalCOMAction
Note: This component is being phased out of the Data Transformation system. For backwards compatibility,
the Studio displays the component in existing projects that use it. It cannot be used in new projects. For more information about custom processors, see the Data Transformation Engine Developer Guide. The ExternalCOMAction component runs a custom action. You can implement the custom action as a COM (ActiveX) DLL or as a .NET DLL with the COM interoperability feature. Because this component uses the Microsoft COM architecture to activate the custom action, it runs only on Microsoft Windows platforms.
To create a custom COM action: 1.
In Microsoft Visual Studio, create a DLL project. If you use Microsoft Visual Basic 6, create an ActiveX DLL project containing a class module. If you use Microsoft Visual Studio .NET:
Create a class library project that references ICMAction.dll in the Data Transformation installation folder. Create a class that implements the ICMAction interface. Configure the project with the COM interop feature.
1.
In the class, implement the Run method. In Visual Basic, the syntax of the method is:
Function Run(ByVal inp As String, ByVal design_mode As Boolean) As String
The inp parameter is the input string that the action should process. The ExternalCOMAction component passes the input string to the function and receives the return value as output. The function can have any desired side effects, such as interacting with a third-party system. The design_mode parameter is True if the action is activated within Data Transformation Studio. If the custom action requires a long processing time or has side effects that interfere while you are designing a parser, the function can perform different operations based on the design_mode value.
2.
Register the DLL on the Data Transformation computer. If you use Visual Basic 6, use the regsvr32 command to register the component. If you use Visual Studio .NET, use the regasm command.
3. 4.
Define an ExternalCOMAction that references the ProgID of the DLL. Optionally, add the ExternalCOMAction to the component list that Data Transformation Studio displays. For more information about customizing the component list, see Using Data Transformation Studio in Eclipse.
Description A COMClass component specifying the ProgID of the custom action. A data holder storing the input of the action. A data holder where the action should store its output.
149
Description For more information about these properties, see Standard Action Properties on page 138.
Open Visual Studio .NET and create a new C# Class Library project. Add a reference to the file ICMAction.dll. In the Add Reference window, you can find the reference on the .NET tab, component name ICMAction. Alternatively, browse to ICMAction.dll in the Data Transformation installation folder.
3.
Add a class that implements the ICMAction interface. You can copy the following sample code. Change the namespace and class names, CMActionExample and CCMActionExample, to meaningful names for your project.
using System; using System.Runtime.InteropServices; //Enables COM interop using Itemfield.ContentMaster; //For ICMAction interface namespace CMActionExample { //Prevents automatic creation of class interface. //Causes class to be exported to COM only as an implementor //of the ICMAction interface [ClassInterface(ClassInterfaceType.None)] public class CCMActionExample : ICMAction { public CCMActionExample() { } public string Run(string inp, bool design_mode) { //ToDo: Insert code here } } }
4.
Implement the Run function, inserting code that performs the desired action. For example, the following implementation causes the custom action to count the characters in the input and return the result.
public string Run(string inp, bool design_mode) { Int32 res = inp.Length; return res.ToString(); }
150
5.
In the Solution Explorer, right-click the project and edit its properties.
In the left pane of the properties window, expand the tree and select Configuration Properties / Build. In the right pane, in the Outputs section, set the Register for COM Interop property to true.
6.
Right-click the project and click Build. This generates the DLL file.
On the computer where you developed the .NET project, Visual Studio .NET registers the DLL when you build the project. The DLL is ready to use in the ExternalCOMAction component. To run the custom action in Data Transformation on another computer, you must install the custom DLL as follows:
To install the DLL on another computer: 1. 2. 3.
Confirm that Microsoft .NET Framework, version 1.1 or higher, is installed on the computer. Copy the custom DLL to any convenient location on the computer, such as the Data Transformation program folder. Open a command prompt, and use the regasm utility to register the DLL. The utility is located in the Windows folder, in the subfolder Microsoft.NET\Framework\<version>. For example, enter the following command:
regasm <path>\YourCustomDLL.dll /codebase
The regasm utility displays a message indicating that the DLL was successfully registered.
Online Sample
For an online sample of a Visual Studio .NET project that implements a custom action in the C# language, see the following location in the Data Transformation installation folder:
samples\SDK\CMACTION
JavaScriptFunction
This action executes a JavaScript function, for example, a function located in an HTML source document. You can pass parameters to the function, and you can store the return value of the function.
Note: This action is being phased out of the Data Transformation system. For full JavaScript support, it requires
an external JavaScript processor that was supplied with earlier, 32-bit Data Transformation versions. The external processor does not run on 64-bit platforms, and it is no longer included in the Data Transformation setup. For compatibility with projects created in previous versions, running on 32-bit platforms, you can request a copy of the external processor from Informatica. Do not use JavaScriptFunction in new projects.
Table 10-30. Basic Properties
Property
function_to_execute result params
Description The name of the function. A data holder in which to store the return value of the function. A list of data holders containing the input parameters of the function. The parameters must be in the same order as in the function declaration.
151
Description If selected, Data Transformation recompiles the function for each page that a parser processes. If not selected, Data Transformation assumes that the function is the same on all the pages, and it compiles the function only on the first page. For more information about these properties, see Standard Action Properties on page 138.
name
Map
This action copies a value from one data holder to another. When copying a data holder that has a simple XSD data type, the source and destination must have compatible data types. The action can apply transformers to the copied value. If you use the action to copy a multiple-occurrence data holder that has a simple type, and the action is not located within an iterating component such as a RepeatingGroup, the action copies all the occurrences of the data holder. If you use the action to copy a data holder that has a complex type, the source and destination must have identical internal structures and identical XSD types. The action copies the nested elements and attributes.
Table 10-32. Basic Properties
Property
source target transformers
Description The source data holder. The destination data holder. A sequence of transformers that modify the value. Do not assign this property if the source and destination are complex XML elements.
Description For more information about these properties, see Standard Action Properties on page 138.
Online Sample
For an online sample of this action, open the project samples\Projects\CopyValue\CopyValue.cmw. The sample uses a Map action to copy a complex element that contains an attribute and nested elements.
ODBCAction
This action runs a SQL query on a database. For example, it can perform a SELECT query that retrieves data, or it can perform an INSERT or UPDATE query that adds data to the database.
152
Example
A source document contains an employee ID number. A parser retrieves the ID and stores it in a variable called EmpID. You want to retrieve the employee's name from a database and store the result in the following XML structure:
<Person> <Name> <First>...</First> <Last>...</Last> <Name> </Person>
In this example:
The db_connection property defines the database connection. The output_record defines the data holder where the action should store the retrieved data. The sql_statement is the SQL query that retrieves the data. The input_parameters property contains the EmpID variable, which is the input of the action.
Description An ODBC_XML_Connection component defining the ODBC provider, which is typically a database. The SQL query, for example:
SELECT Name FROM Employees WHERE Id = ? Use the ? symbol to represent an input parameter. If there is more than one input parameter, each? symbol represents the next parameter in sequence, for example: SELECT Name FROM Employees WHERE Id =? AND Gender =? In this case, the two? symbols represent the first and second input parameters,
sql_statement
respectively. The SQL syntax must be valid for the ODBC provider. Please see the provider or database documentation for details.
Description The behavior if the SQL query does not retrieve any data. The value can be: - Success. The action does not fail. - Fail.The action fails. An XML element, defined in the XSD schema, where the action should store any data that the SQL query retrieves. The element must nested elements, at the top level of nesting, whose names are identical to the output fields of the query. If the SQL query retrieves multiple records, the schema must permit multiple occurrences of the XML element. For more information, see Multiple-Occurrence Data Holders on page 67. The number of retries if the first connection attempt fails. A list of data holders that contain the input parameters.
output_record
retry input_parameters
153
Description For more information about these properties, see Standard Action Properties on page 138.
ResetVisitedPages
This action clears the list of visited pages of specified secondary parsers. This action is used with the reject_recurring_pages property of a Parser component. ResetVisitedPages allows multiple visits to the same page, even if reject_recurring_pages is selected. You might do this, for example, if you want to post different input data to the same web page.
Table 10-36. Basic Properties
Property
parsers
Description For more information about these properties, see Standard Action Properties on page 138.
RunMapper
This action runs a mapper. For example, you can use this action in a parser to run a mapper that modifies the parser output.
Table 10-38. Basic Properties
Property
mapper
Description The mapper. You can select the name of an existing Mapper component, or you can create a Mapper component at this location of the IntelliScript. For more information, see Mappers on page 183. A data holder storing XML text on which to run the mapper. The data holder must have a simple data type such as xs:string. The value of the string can be XML text of any complexity. For more information about how to run a mapper on a data holder that has a complex type, see EmbeddedMapper on page 188. If you omit this property, the mapper uses the data holders available in the scope of the action. For example, if the action is nested in a parser, the mapper runs on the output of the parser. If the action is within a Group, it runs on the output of the Group.
input
154
Description For more information about these properties, see Standard Action Properties on page 138.
RunParser
This action runs a parser. In a parser, for example, you can use this action to follow the links in an HTML file and run a secondary parser on the link destinations. In a serializer, you can use the action to parse bits of unstructured data that exist in the input. The output of RunParser is appended to the output of the main component that activated it, such as a parser or serializer. The RunParser action differs from the EmbeddedParser anchor, in that RunParser parses a new source, whereas EmbeddedParser parses a section of an existing source.
Example
An HTML file has a link to a second file. A Content anchor stores the file path of the link destination in the VarLinkURL system variable. The RunParser action accesses the destination file and runs a secondary parser on it. In another example, the main parser contains an Alternatives anchor that selects a secondary parser according to text in the source document. For more information, see the Alternatives on page 83.
Table 10-40. Basic Properties
Property
next_parser
Description The name of the parser to run. Recursive calls to the same parser are permitted.
Description This property specifies the type of data that the input_source data holder contains. If input_source_as_text is selected, input_source contains a text string that should be parsed. If not selected, input_source contains a file path. If input_source_as_text is selected, input_source is a data holder that contains a string to be parsed. If input_source_as_text is not selected, input_source is a data holder containing the path of the document to be parsed. The default value is the VarLinkURL system variable. If the VarPostData system variable contains a value, the value is posted to the URL. If VarPostData is empty, the action accesses the URL without posting any data. A document processor that the parser should apply to the source. The number of times to retry if the request fails. The interval in seconds between retries. Strings that must be present in the input_source value. If a specified string is not present, the action does not access the source or activate the secondary parser.
input_source
155
Description Strings that must not be present in the input_source. If a string is present, the action does not access the source or activate the secondary parser. For more information about these properties, see Standard Action Properties on page 138.
name
Transformation versions. It is being phased out of the Data Transformation system. Do not use it in new projects. Optionally, you can use the action to post data to a URL. This feature simulates the submission of an HTML form to a web server. The action activates a parser that processes the result returned by the web server. To do this, you must store the data to be posted in the VarPostData system variable.
To prepare the data in VarPostData: 1. 2.
Save a copy of the HTML page containing the form on your local computer. Edit the copy, changing the form action attribute to your email address. For example, if the form element reads <form method="POST" action="http://example.com/MyServer.exe">, change it to <form method="POST" action="mailto:jdoe@example.com">. Open the copy in your browser, fill in the form, and click the submit button. This sends an email containing the form data to your address. The body of the email is a string containing the form data. Assign this string to the VarPostData variable.
3. 4.
Note: Alternatively, you can use the SubmitForm or SubmitFormGet action to submit HTML form data to a
URL.
RunSerializer
This action runs a serializer. The output of the serializer is stored in a data holder. For example, a parser can use this action to run a serializer that modifies the parser output.
Table 10-42. Basic Properties
Property
serializer
Description The serializer. You can select the name of an existing Serializer component, or you can create a Serializer at this location of the IntelliScript. For more information, see Serializers on page 167. A data holder storing XML text on which to run the serializer. The data holder must have a simple data type such as xs:string. The value of the string can be XML text of any complexity. For more information about how to run a serializer on a data holder that has a complex type, see EmbeddedSerializer on page 178. If you omit this property, the serializer uses the data holders available in the scope of the action. For example, if the action is nested in a parser, the serializer runs on the output of the parser. If the action is within a Group, it runs on the output of the Group. A data holder to store the serializer output.
input
output
156
Description For more information about these properties, see Standard Action Properties on page 138.
Online Sample
For an online sample of this action, open the project samples\Projects\RunSerializer\RunSerializer.cmw. To observe how the sample works, set MainParser as the startup component and run it. MainParser contains a RepeatingGroup that parses pairs of names and stores them in variables. After each iteration, the RepeatingGroup executes a RunSerializer action that concatenates the variables with some predefined text. The action stores its output in an XML element that is added to the parser output.
SetValue
This action fills a data holder with predefined content. The assignment overwrites any existing content, except for a multiple-occurrence data holder. For more information, see Multiple-Occurrence Data Holders on page 67.
Table 10-44. Basic Properties
Property
quote data_holder
Description A list of transformers that are applied to the content. For more information about these properties, see Standard Action Properties on page 138.
Sort
This action sorts the occurrences of a multiple-occurrence data holder. The output is saved to the original data holder. For more information, see Multiple-Occurrence Data Holders on page 67. You can sort any multiple-occurrence data holder in the scope of the project, for example:
The output of a parser The input of a serializer The input or output of a mapper A variable
If you run the action on an XML element that contains attributes or nested elements, you can use them as sort keys.
157
Limitation
You cannot use the Sort action if a Key is defined on the multiple-occurrence data holder. For more information, see Locators, Keys, and Indexing on page 191.
Table 10-46. Basic Properties
Property
recurring_element by_fields
Description The multiple-occurrence data holder that should be sorted. The sort keys, in decreasing order of precedence. For each field, select the data holder and an ascending or descending sort. You can select the multipleoccurrence data holder itself, or any of its nested elements or attributes. To sort numerically, a sort key must have a numerical XSD type such as xs:integer.
Description For more information about these properties, see Standard Action Properties on page 138.
SubmitForm
Note: This component is provided for compatibility with projects created in earlier Data Transformation
versions. It is being phased out of the Data Transformation system. Do not use it in new projects. This action submits HTML form data to a URL and parses the response.
SubmitForm SubmitFormGet
uses the HTTP Post method to submit the form. To use the HTTP Get method, use the action, instead. See also the RunParser action, which can submit an HTML form.
The output of SubmitForm is appended to the output of the main component that activated it, such as a parser or serializer. is an alternative to using the HtmlForm anchor. HtmlForm is easier to use because it performs some of the data-preparation steps automatically. SubmitForm gives you greater control because it lets you configure these steps yourself.
SubmitForm To use the SubmitForm action: 1.
Store the URL to which you want to submit the form in the VarFormAction system variable. The URL corresponds to the action attribute of an HTML <form> element,
2.
Store the form data in the VarFormData system variable. You can determine the correct format of the data in the following way:
Save a copy of the HTML page containing the form on your local computer. Edit the copy, changing the form action attribute to your email address. For example, if the form element reads <form method="POST" action="http://example.com/MyServer.exe">, change it to <form method="POST" action="mailto:jdoe@example.com">. Open the copy in your browser, fill in the form, and click the submit button. This sends an email containing the form data to your address. The body of the email is a string containing the form data. Assign this string to the VarFormData variable.
158
3.
Run the SubmitForm action. The action submits the data that you stored in VarFormData to the location that you stored in VarFormAction.
is a multiple-occurrence variable. This means that you can create multiple occurrences of each storing a different set of post data.
If you do this, SubmitForm posts each occurrence of VarFormData independently, and it parses each of the webserver responses. You can use the CombineValues action to prepare the VarFormData occurrences. For example, if you know the possible values of each form field, CombineValues can prepare all possible combinations of the values.
Table 10-48. Basic Properties
Property
action
Description An OpenURL component specifying how to parse the web-server response. For more information, see OpenURL on page 162.
Description For more information about these properties, see Standard Action Properties on page 138.
Online Sample
For an online sample of this action, open the project samples\Projects\SubmitForm\SubmitForm.cmw. The sample works in the following way:
The main parser, Flower_form_parser, retrieves options from the HTML order form of an online florist. The options include several flower types and price ranges. The parser uses a CombineValues action to prepare all possible combinations of the flower-type and pricerange options. The parser uses a SubmitForm action to post the combinations to an web application. The SubmitForm action activates a secondary parser that parses the responses from the web application. The parsing output is added to the output of the main parser.
Note: You cannot run this sample because the web application does not exist.
SubmitFormGet
Note: This component is provided for compatibility with projects created in earlier Data Transformation
versions. It is being phased out of the Data Transformation system. Do not use it in new projects. This action submits HTML form data to a URL and parses the response.
SubmitFormGet is identical to SubmitForm, except that it uses the HTTP Get method instead of Post. For more information, see SubmitForm on page 158.
WriteValue
This action writes the value of a data holder to a location such as a file or to a string-type data holder.
159
If the input data holder is an XML element, the action writes both the element and any nested elements and attributes.
Note: The MSMQ and COM output options ( MSMQOutput and OutputCOM) are supported for compatibility with projects created in earlier Data Transformation versions. The options are being phased out of the Data Transformation system. Do not use them in new projects.
Table 10-50. Basic Properties
Property
input output
Description The data holder to write. The output location. The options include: - MSMQOutput. Writes to an MSMQ queue. - OutputDataHolder. Writes to a data holder. - OutputFile. Writes to a file. - ResultFile. Writes to the default results file of the transformation. - OutputCOM. Uses a custom COM component to output the data. Do not select this option directly. Instead, select the display name of the custom COM component. For more information about these options, see Action Subcomponent Reference on page 161. In addition, you can choose: - OutputPort. The name of an AdditionalOutputPort where the data is written. For more information, see Ports on page 15. - StandardErrorLog. Writes to the user log. For more information, see Failure Handling on page 231.
Description By default, the action surrounds the value that it writes with XML tags. If you select no_tags, the XML tags are omitted. This is appropriate only if input is a simple data holder, containing no nested elements or attributes. A list of transformers that modify the value before writing. The input to the transformers is the complete input data holder, including XML tags. For more information about these properties, see Standard Action Properties on page 138.
transformers
Online Samples
For an online sample of this action, open the project samples\Projects\Splitter\Splitter.cmw. The sample demonstrates how to split a file into two files. A parser uses a RepeatingGroup to retrieve the records of an HL7 file. It uses a Map action to create unique filenames for each record, and a WriteValue action to write the records to the files. The output files, MyOutput1.txt and MyOutput2.txt, are stored in the Results folder of the project.
Note: Alternatively, you can use a streamer to split large inputs. For more information, see Streamers on
page 207.
XSLTMap
This action runs an XSLT transformation. The input and output are branches of an XML document. For example, they can be the output of a parser or the input of a serializer.
160
Example
Suppose that the following XML is the result of a parser:
<Person> <First>Ron</First> <Last>Lehrer</Last> </Person>
You can use the XSLTMap action, with an appropriate XSLT file, to convert this to:
<Person Name="Lehrer, Ron" />
Description The XML element at the root of the branch to be transformed. The XML element at the root of the branch that should store the output. Browse to the XSLT file.
Description For more information about these properties, see Standard Action Properties on page 138.
COMClass
This subcomponent is used in ExternalCOMAction to define a custom COM component.
Table 10-54. Basic Properties
Property
ProgID
Description The ProgID of the COM component. If you developed the custom action in Visual Basic 6, the ProgID typically has the form dll_name.class. If you used Visual Studio .NET with the COM interoperability option, the ProgID has the form namespace.class. For example, if you developed a .NET namespace with the name CMActionExample, and the class name is CCMActionExample, the ProgID is CMActionExample.CCMActionExample.
Description Deselect this option if the COM component is incompatible with multithreading. This causes Data Transformation to synchronize calls to the component.
161
MSMQOutput
Note: This subcomponent is provided for compatibility with projects created in earlier Data Transformation
versions. It is being phased out of the Data Transformation system. Do not use it in new projects. This subcomponent specifies that a stream should be written to an MSMQ message and sent to a queue. The subcomponent is used in the WriteValue action to specify the output location.
Table 10-56. Basic Properties
Property
output_id
Description The MSMQ queue identifier, such as path. You can type the identifier or browse to a data holder that contains the identifier.
Description This property is not in use. For more information about these properties, see Standard Action Properties on page 138.
ODBC_XML_Connection
The subcomponent defines a database connection. It is used, for example, in an ODBCAction. Before using this component, use the operating system tools to define a DSN for the database connection.
Table 10-58. Advanced Properties
Property
DSN username password timeout
Description The data source name of the connection. User name for the database connection. Password of the user. Time in seconds to wait for the database response.
OpenURL
This subcomponent is used within the SubmitForm and SubmitFormGet actions to specify how to parse a webserver response.
Table 10-59. Basic Properties
Property
next_parser
Description The number of retries if the first request fails. The interval in seconds between retries. For more information about these properties, see Standard Action Properties on page 138.
162
Description
OutputCOM
Note: This subcomponent is provided for compatibility with projects created in earlier Data Transformation
versions. It is being phased out of the Data Transformation system. Do not use it in new projects. The OutputCOM option of the WriteValue action allows you to use a custom COM component to create output from Data Transformation. The component can perform any desired operations, such as modifying the data, writing the data to multiple locations, or interacting with an information system. Because OutputCOM uses the Microsoft COM technology, it operates only on Microsoft Windows systems. The OutputCOM component works differently from most other Data Transformation components. It is a template for a custom component, and not a component that you can use directly. You cannot configure an action with the OutputCOM option and nest a custom component within OutputCOM. Instead, you must program a custom COM component, add it to the drop-down list, and select its name in the output property of WriteValue. The following paragraphs explain the procedure.
Description An identifier for the desired output location. The value of this parameter is the value of the output_id property of the OutputCOM component. The content that the action is outputting. If the append property of the component is not selected in the IntelliScript, mode = "CREATE". If the append property is selected, mode = "APPEND".
outContent mode
The function can perform any desired operations. Data Transformation ignores the return value of the function. Install and register the DLL on the Data Transformation computer.
In Notepad, create a text file. Type a line such as the following in the file:
profile DisplayName ofPT OutputCOMT("MyProject.MyClass")
Here, DisplayName is the name that you would like to display in the drop-down list, and MyProject.MyClass is the ProgID of the component.
Action Subcomponent Reference 163
3.
Save the file with an extension of *.tgp, for example, MyClass.tgp, in the program subdirectory DataTransformation\AutoInclude\User.
Configure an action. In the output property of the action, select the DisplayName that you configured above. Assign the output_id and append properties of the DisplayName component. You can then run the transformation. The action activates the custom component.
Description An identifier for the location where the custom COM component should write its output. You can use this parameter, for example, to pass the name of an output file to the custom component. You can type an identifier, or you can browse to a data holder that contains the identifier.
Description If selected, the custom COM component should append its output to the existing content of the output location rather than overwriting it. For more information about these properties, see Standard Action Properties on page 138.
name remark
OutputDataHolder
This subcomponent specifies how to write a stream to a data holder. The subcomponent is used in the WriteValue action to specify the output location.
Table 10-63. Basic Properties
Property
data_holder
Description A sequence of transformers that modify the stream before writing. For more information about these properties, see Standard Action Properties on page 138.
OutputFile
This subcomponent specifies that a stream should be written to a file.
164
The subcomponent is used in the DumpValues and WriteValue actions to specify the output location.
Table 10-65. Basic Properties
Property
file
Description The filename, optionally including a path. You can type the name, or you can browse to a data holder that contains the name. The path can be absolute or relative. In the latter case, Data Transformation resolves the path relative to the output folder of the transformation. If you run the transformation within Data Transformation Studio, the path is relative to the project Results folder.
Description If selected, the data is appended to the existing content of the file, rather than overwriting it. For more information about these properties, see Standard Action Properties on page 138.
ResultFile
This subcomponent specifies that a stream should be written to the normal output file of a project. The subcomponent is used in the DumpValues and WriteValue actions to specify the output location.
165
166
CHAPTER 11
Serializers
This chapter includes the following topics:
Creating a Serializer, 167 Running a Serializer, 172 Serialization Anchors, 172 Standard Serializer Properties, 174 Serializer Quick Reference, 174 Serializer Component Reference, 174 Serialization Anchor Component Reference, 175
Creating a Serializer
Serialization is the opposite of parsing. A parser converts a source document from any format to an XML file. A serializer converts an XML file to an output document in any format. For example, the output of a serializer can be a text document, an HTML document, or even another XML document. You can create a serializer by any of the following methods:
By inverting the configuration of an existing parser By using the New Serializer wizard By editing the IntelliScript and inserting a Serializer component
You can combine these methods. For example, you can invert a parser and edit the IntelliScript of the resulting serializer. It is usually easier to create a serializer than a parser. This is because the XML input is completely structured. The structure makes it easy to identify the required data and write it, in a sequential procedure, to the output. A parser, in contrast, may need to process unstructured or semi-structured input, a task that is often more complex than serialization. The main components that are nested in a serializer are called serialization anchors. The function of the serialization anchors is to identify the XML data and write it to the output. Serialization anchors are analogous to the anchors that are used in a parser, except that they work in the opposite direction.
167
In Data Transformation Studio, open an existing parser in an IntelliScript editor. Right-click the parser and click Create Serializer. Data Transformation notifies you that the serializer has been created, and it displays the serializer in the IntelliScript. The name of the serializer is derived from that of the parser, with the suffix _serializer. For example, if you create a serializer from Parser1, the serializer is called Parser1_serializer. Data Transformation stores the serializer in a new TGP script file, which has a name such as Parser1_auto_generated_serializer.tgp. Use the Data Transformation Explorer or the Component view to open the new serializer file in an IntelliScript editor.
3.
Test the serializer. For more information, see Running a Serializer on page 172. If necessary, edit the IntelliScript. For more information, see Troubleshooting an Auto-Generated Serializer on page 170.
Online Sample
For an example of an auto-generated serializer, open samples\Projects\Serialization\TabDelimited\
TabDelimited.cmw. To run the sample: 1. 2.
Set MyHL7Parser as the startup component, and run it. This generates an output file Results\output.xml. Now set MyHL7Parser_serializer as the startup component, and run it. At the prompt, browse to Results\output.xml as the input. The original input file is regenerated.
A variant of this project is in samples\Projects\Serialization\HL7\HL7.cmw. You can generate the serializer yourself and try the above experiment.
168
Anchor
Marker
Now, generate a serializer from this parser, and run the serializer on the following input:
<FullName>Larissa Chan<FullName>
Serialization Mode
The example source might contain text that you don't want in the serializer output. In that case, you can modify the behavior of the Create Serializer command in a way that does not generate the StringSerializer serialization anchors. To do this, set the serialization_mode property of the Parser component. The possible values of the serialization_mode are explained in the following table.
Value
Full
Description The Create Serializer command copies the non-XML text to the serializer configuration. This is the default behavior. The Create Serializer command copies only the delimiters of the non-XML text to the serializer configuration. Under the Outline option, you can select the use_markers option. This causes the Create Serializer command to copy the content of the Marker anchors but only the delimiters of other non-XML text.
Outline
Behavior The Create Serializer command converts: - Content anchors to ContentSerializer serialization anchors - The delimiters of other text in the example source to StringSerializer serialization anchors The Create Serializer command converts: - Content anchors to ContentSerializer serialization anchors - The complete text of Marker anchors to StringSerializer serialization anchors - The delimiters of other text in the example source to StringSerializer serialization anchors The Create Serializer command converts: - Content anchors to ContentSerializer serialization anchors - All other text in the example source to StringSerializer serialization anchors
not selected
Name<tab>Larissa Chan
selected
full
Creating a Serializer
169
Root Tag
On the XML Generation tab of the project properties, there is an option to Add XML Root Element. The effect of this option is to nest the parser output in a specified root element. For more information, see XML Generation Properties on page 223. If this option is selected, and you try to run an auto-generated serializer on the parser output, it cannot find the input XML elements because of the nesting. The solution is to set the root_tag property of the serializer to the same value as in the project properties. The serializer then finds its input nested under the root.
Variables
If the parser uses a variable to store intermediate results, an auto-generated serializer may fail. To solve the problem, review the serializer logic, and remove the variable if necessary.
Additional Components
The Create Serializer command inverts the anchors of a parser. It does not invert components such as document processors, transformers, or actions. For example, suppose that a parser uses a PdfToTxt_4 document processor to convert PDF source documents to text. The parser contains anchors that transform the text to XML. The auto-generated serializer transforms the XML back to text. It does not convert the text to PDF. To obtain PDF output, edit the serializer and insert an XmlToDocument processor. In another example, suppose that a parser uses an AddString transformer to add a prefix to the output of a Content anchor. The auto-generated serializer does not remove the prefix. If you need to remove it, you can insert a component such as a Replace transformer.
Click File > New > Project. Under the Data Transformation category, select a Serializer Project and click Next. Follow the wizard prompts to enter the serializer options. When you finish, the Data Transformation Explorer view displays the new project containing the serializer. The Component view displays the serializer.
170
Click File > New > Serializer. Follow the wizard prompts to enter the serializer options. When you finish, the Data Transformation Explorer view displays a new TGP script file defining the serializer. The Component view displays the serializer.
Display the serializer in an IntelliScript editor. Under the contains line, add a sequence of serialization anchors and actions. Run and test the serializer, and modify the IntelliScript as required. For more information, see Running a Serializer on page 172.
At the top level of the IntelliScript, select the three dots (...) symbol. Press Enter and type a name for the serializer. To the right of the name, press Enter. Select a Serializer component from the list. Expand the tree under the Serializer component. Assign its properties as required. If necessary, add an XSD schema defining the XML syntax of the serializer input. For more information, see Data Holders on page 55.
5.
Under the contains line, add a sequence of nested serialization anchors and actions. For more information, see Serialization Anchors on page 172 and Actions on page 137.
6.
Run and test the serializer and modify the IntelliScript as required. For more information, see Running a Serializer on page 172.
Online Sample
For an example of a serializer that we created by editing the IntelliScript, open the project samples\Projects\ ManualSerializer\ManualSerializer.cmw. You can run the serializer on the input file Example XML of Person.xml.
Creating a Serializer
171
Running a Serializer
To run a serializer in Data Transformation Studio: 1. 2. 3.
Set the serializer as the startup component. Click Run > Run. In the I/O Ports table, double-click the input row and select the input XML file. If you created the serializer from a parser, a convenient test file is the parser output. Browse to the output file, by default Results\output.xml, in the project folder. Alternatively, you can set the example_source property of the serializer. This lets you test a serializer repeatedly on the same input, without needing to browse to the file each time.
4. 5.
When the execution is complete, Data Transformation Studio displays the Events view. Examine the events for any failures or warnings. To view the serialization results, open the output file, located in the Results folder of the project.
Serialization Anchors
The main components that you can use in a serializer are called serialization anchors. These are analogous to the anchors that are used in a parser, except that they work in the opposite direction. Anchors read data from locations in the source document and write the data to XML. Serialization anchors read XML data and write the data to locations in the output document. Please note that a serialization anchor is not an anchor, despite their similar names. You cannot use anchors in a serializer, and you cannot use serialization anchors in a parser. The most important serialization anchors are ContentSerializer and StringSerializer:
A ContentSerializer writes the content of a specified data holder to the output document. It is the inverse of a Content anchor, which reads content from a source document. A StringSerializer writes a predefined string to the output. It is the inverse of a Marker anchor, which finds a predefined string in a source document.
The first StringSerializer instructs the serializer to write the following text in the output document:
First Name:<tab>
The ContentSerializer writes the value of the Person/Name/First element to the output.
172
Note: The IntelliScript represents the newline and tab using ASCII codes and , respectively. For more information about entering special characters in the IntelliScript Editor, see Using Data Transformation Studio in Eclipse.
Now, assume that you run the serializer on the following XML:
<Person gender="M"> <Name> <First>Ron</First> <Last>Lehrer</Last> </Name> <Id>547329876</Id> <Age>27</Age> </Person>
The serializer contains additional serialization anchors, which are not shown in the above illustration. The complete output of the serializer is:
Serialization Anchors
173
Description A name that you assign to the component. Data Transformation includes the name in the event log. This can help you find an event that was caused by the particular component. A comment describing the component. If selected, Data Transformation ignores the component. This is useful for testing and debugging, or for making minor modifications in a project without deleting the existing components. By default, if a component fails, its parent component fails. If you select the optional property, the parent component does not fail. For more information, see Failure Handling on page 231. If the component fails, writes an entry in the user log. For more information, see Failure Handling on page 231.
remark disabled
optional
on_fail
Within a Serializer component, you can nest the following serialization anchors:
Serialization Anchor
AlternativeSerializers
Description Specifies alternative serialization anchors that may be appropriate, depending on the structure of the XML. Serializes XML data and writes it to the output document. Serializes sections of data, writing a separator string between them. Runs a secondary serializer. Binds a set of serialization anchors together for processing as a unit. Creates a repetitive structure in the output document. Writes a specified string to the output document.
174
Serializer
A Serializer converts XML documents to output documents in any format.
Table 11-1. Advanced Properties
Property
validate_source_document
Description The level of source XML validation that the serializer performs. The options are: - Partial. Permits some deviations from the schema. - Strict. Enforces the schema strictly. For more information, see Role of XSD in Serialization and Mapping on page 63. A sample XML source document. When you run the serializer in Data Transformation Studio, it operates on the sample document. The value of the property is an input port. For more information, see Ports on page 15. If you leave the example_source property blank, Data Transformation prompts you for a source document when you run the serializer. Nested within the example_source, you can assign a preprocessor that converts the source documents to a format that the serializer can accept. For example, the example_source might be a Microsoft Excel workbook configured with the ExcelToXml preprocessor. For more information, see Document Processors on page 23. The file extension of the generated output file, including the leading period, for example:
.txt
example_source
output_file_extension
root_tag
The name of a root XML element that is not in the XSD schema of the project. For example, if the top-level element of the schema is Person, but the XML input nests Person in an element called InputWrapper, enter root_tag = InputWrapper. A list of transformers that the Serializer applies to all serialized data. These properties are useful in situations where the serializer must select specific occurrences of data holders. For more information, see Locators, Keys, and Indexing on page 191. For more information about these properties, see Standard Serializer Properties on page 174.
name
remark on_fail
AlternativeSerializers
This serialization anchor lets you define a set of alternative, nested serialization anchors. You can define a criterion for the alternative that the serializer should accept. Only the accepted alternative affects the serializer output. The other serialization anchors, whether failed or successful, have no effect on the serializer output.
Example
The input XML might contain a Product element or a Service element, but not both. You want to serialize whichever element is in the input.
175
Define an AlternativeSerializers serialization anchor, and set its selector property to ScriptOrder. Within the AlternativeSerializers, nest two ContentSerializer serialization anchors. Configure one of them to process the Product element and the other to process Service.
Table 11-2. Basic Properties
Property
selector
Description The criterion for deciding which alternative to accept. The options are: - ScriptOrder. Data Transformation tests the nested serialization anchors in the sequence that they are defined in the IntelliScript. It accepts the first one that succeeds. If all the nested serialization anchors fail, the AlternativeSerializers component fails. - NameSwitch. Data Transformation searches for the nested serialization anchor whose name property is specified in a data holder. It ignores the other nested serialization anchors. If the named serialization anchor fails, the AlternativeSerializers component fails.
Description For more information about these properties, see Standard Serializer Properties on page 174.
ContentSerializer
This serialization anchor writes the serialized data to the output document.
Table 11-4. Basic Properties
Property
opening_str closing_str data_holder
Description A string that the anchor should write before the data_holder. A string that the anchor should write after the data_holder. The data holder containing the data.
Description If selected, the data_holder can be empty. If not selected, and the data_holder is empty, the ContentSerializer fails. If selected, the default transformers of the Serializer are not applied to the serialized data. A list of transformers that are applied to the serialized data. For more information about these properties, see Standard Serializer Properties on page 174.
ignore_default_transformers
transformers name
176
DelimitedSectionsSerializer
This serialization anchor processes sections of data. Between each section of the output, the DelimitedSectionsSerializer writes a separator string. Within the DelimitedSectionsSerializer, nest other serialization anchors. Each nested serialization anchor is responsible for outputting a single section.
Example
The XML input contains an employee resume. You wish to write the data to an output text document in the following format:
---------------------------Jane Palmer Employee ID 123456 ---------------------------Professional Experience ... ---------------------------Education ...
Define a DelimitedSectionsSerializer with the line of hyphens as its separator. Because you want a line of hyphens before each section, set separator_position = before. Within the DelimitedSectionsSerializer, nest three GroupSerializer components. The first GroupSerializer writes the Jane Palmer section, the second writes the Professional Experience section, and so forth.
Optional Sections
In the above example, suppose that the second section, Professional Experience, is missing from some input XML documents. You nonetheless want to write its separator to the output, like this:
---------------------------Jane Palmer Employee ID 123456 ------------------------------------------------------Education ...
In the second GroupSerializer, select the optional property. This means that if the GroupSerializer fails, it should not cause the DelimitedSectionsSerializer to fail. In the DelimitedSectionsSerializer, set using_placeholders = always. This means to write the separator of an optional section, even if the section itself is missing.
Alternatively, suppose that if the Professional Experience section is missing, you do not want to write its separator:
---------------------------Jane Palmer Employee ID 123456 ---------------------------Education ...
177
In the DelimitedSectionsSerializer, set using_placeholders = never. This means not to write the separator of a missing section.
Description The separator string. Position of the separator relative to the sections. The options are before, after, between, and around. This property specifies whether the DelimitedSectionsSerializer should write the separator of an optional section that is missing from the XML input. The options are always, never, and when necessary.
using_placeholders
The following table illustrates the possible values of the separator_position property. The examples assume that the separator is a vertical-line character ( |).
separator_position
before after between
Explanation Write a separator before each section, including the first sections. Write a separator after each section, including the first sections. Write a separator between the successive sections, but not before the first section and not after the last section. Write separators before and after each section, including the first sections.
Example
|1|2|3|4 1|2|3|4| 1|2|3|4
around
|1|2|3|4|
The following table illustrates the possible values of the using_placeholders property. The examples assume that the separator_position is before and that sections 2 and 4 are missing.
using_placeholders
always never when necessary
Explanation Always write the separator of a missing section. Never write the separator of a missing section. Always write the separator of a missing internal section. Never write the separator of a missing terminal section.
Example
|1||3| |1|3 |1||3
Description For more information about these properties, see Standard Serializer Properties on page 174.
EmbeddedSerializer
This serialization anchor activates a secondary Serializer, which writes its output in the same output document.
Example
The XML input is a family tree. The input contains Person elements, which are recursively nested as shown:
<Person> <!-- Parent -->
178
A Serializer can use an EmbeddedSerializer component to call itself recursively, until all levels of nesting are exhausted.
Table 11-8. Basic Properties
Property
serializer
Description The name of the secondary serializer. The serializer must be defined at the global level of the IntelliScript. Connects the data holders that are referenced in the secondary serializer to the data holders that are referenced in the main serializer. The property contains a list of Connect subcomponents that define the correspondence. For more information, see Connect on page 103. If all the data holders in the main and secondary serializers are identical, you can omit this property. If there are any differences between the data holders, you must connect the data holders explicitly, even the ones that are identical. In the recursive example described above, Person should be connected to Person/Person. This instructs the secondary instance of the serializer to process a nested level of the input. It is sufficient to connect just the parent element ( Person), and not the nested elements (Person/*s/Name, Person/*s/ BirthDate, etc.), provided that the two Person elements have the same XSD type.
schema_connections
Description For more information about these properties, see Standard Serializer Properties on page 174.
GroupSerializer
The GroupSerializer serialization anchor binds its nested serialization anchors together. You can set properties of the GroupSerializer that affect the members of the group.
Table 11-10. Basic Properties
Property
source target
Description These properties are useful in situations where the serialization anchor must select specific occurrences of data holders. For more information, see Locators, Keys, and Indexing on page 191.
Description For more information about these properties, see Standard Serializer Properties on page 174.
179
Description
RepeatingGroupSerializer
This serialization anchor writes a repetitive structure to the output document. A RepeatingGroupSerializer is useful if the XML data contains a multiple-occurrence data holder. It iterates over the occurrences of the data holder and outputs the data. For more information, see Multiple-Occurrence Data Holders on page 67. Within the RepeatingGroupSerializer, nest serialization anchors that process and output each occurrence of the data holder. Optionally, you can define a separator that the RepeatingGroupSerializer writes to the output between the iterations.
Example
The XML input contains the following structure:
<Persons> <Person> <Name>John</Name> <Age>35</Age> </Person> <Person> <Name>Larissa</Name> <Age>42</Age> </Person> ... </Persons>
A RepeatingGroupSerializer, using a newline character as a separator, can output this data to:
John Larissa 35 42
You can iterate over several multiple-occurrence data holders in parallel. For example, you can iterate over a list of men and a list of women, and output a list of married couples. To do this, insert a ContentSerializer within the repeating group for each data holder.
Table 11-12. Basic Properties
Property
separator
Description A serialization anchor, typically a StringSerializer, that outputs the separator. Leave this property empty if you do not want to output a separator. Position of the separator relative to the iterations. The options are before, after, between, and around.
separator_ position
The following table illustrates the possible values of the separator_position property. The examples assume that the separator is a vertical-line character ( |).
separator_position
before after
Explanation Write a separator before each iteration, including the first iteration. Write a separator after each iteration, including the last one.
Example
|1|2|3 1|2|3|
180
separator_position
between
Explanation Write a separator between the successive iterations, not before the first iteration and not after the last iteration. Write separators before and after each iteration, including the first and last iterations.
Example
1|2|3
around
|1|2|3|
Description The number of iterations to run. Enter a number, or click the browse button and select a data holder that contains the number. If blank, the iterations continue until the input is exhausted. A data holder, where the RepeatingGroupSerializer should output the number of the current iteration. You can use a ContentSerializer to write the number to the output. These properties are useful in situations where the serialization anchor must select specific occurrences of data holders. For more information, see Locators, Keys, and Indexing on page 191. If an iteration fails, writes an entry in the user log. User the on_fail property to write an entry if the entire RepeatingGroupSerializer fails. Use on_iteration_fail to write an entry if a single iteration fails. For more information, see Failure Handling on page 231. For more information about these properties, see Standard Serializer Properties on page 174.
current_iteration
source target
on_iteration_fail
name
StringSerializer
This serialization anchor writes a predefined string to the output document.
Table 11-14. Basic Properties
Property
str
Description For more information about these properties, see Standard Serializer Properties on page 174.
181
182
CHAPTER 12
Mappers
This chapter includes the following topics:
Creating a Mapper, 183 Components Nested within a Mapper, 184 Mapper Example, 184 Running a Mapper, 185 Standard Mapper Properties, 186 Mapper Quick Reference, 186 Mapper Component Reference, 186 Mapper Anchor Component Reference, 187
Creating a Mapper
Mappers are components that convert an XML source document to another XML structure or schema. A mapper processes the XML input like a serializer, and it generates the XML output like a parser. Because both the input and the output are fully structured XML, the configuration is straightforward. The principles of mapper operation are similar to those of a serializer. For more information, see Serializers on page 167. Within a mapper, you can nest mapper anchors and actions. Mapper anchors are analogous to anchors, which are used in parsers, and to serialization anchors, which are used in serializers. This chapter explains how to configure the mapper and mapper anchor components.
To create a mapper: 1.
Add XSD input and output schemas to the project. It is permitted to use either the same schema or different schemas for the input and the output. For more information, see Data Holders on page 55.
2. 3.
At the top level of the IntelliScript, add a Mapper component. Assign the source and target properties of the Mapper to the input and output elements of the Mapper, respectively. For more information, see Locators, Keys, and Indexing on page 191
4.
183
As you add components to the mapper, the Studio color-codes the corresponding locations in the example source. The colors can help you confirm that the components are defined correctly.
5. 6.
Edit the other properties of the Mapper as required. Within the Mapper, nest a sequence of Map actions, mapper anchors, and any other required components. For more information, see Components Nested within a Mapper on page 184.
7.
Test the mapper and modify the IntelliScript if required. For more information, see Running a Mapper on page 185.
Any number of Map actions. The actions retrieve a data holder from the output and write the content to the output. Optionally, any number of mapper anchors. For more information, see the Mapper Anchor Component Reference on page 187. Optionally, any number of additional actions.
The Map actions and the mapper anchors can be in any sequence. You can also insert other actions in the sequence. Notice that a mapper uses Map actions rather than mapper anchors to write to the output XML. This may seem a little different from parsers and serializers, where the output is created by anchors and serialization anchors, respectively. Actually, this is just a terminology issue. The Map action could have been defined as a mapper anchor. It is defined as an action because it is useful in other circumstances, unrelated to mappers.
Mapper Example
To illustrate the mapper configuration, we present a simple example.
Source XML
The input of the mapper is an XML document containing a list of personal names and their associated ID numbers.
<Persons> <Person ID="10">Bob</Person> <Person ID="17">Larissa</Person> <Person ID="13">Marie</Person>
184
</Persons>
Output XML
The desired output of the mapper is an XML list of the names and ID numbers, with no association between them.
<SummaryData> <Names> <Name>Bob</Name> <Name>Larissa</Name> <Name>Marie</Name> </Names> <IDs> <ID>10</ID> <ID>17</ID> <ID>13</ID> </IDs> </SummaryData>
Mapper Configuration
The following mapper configuration performs the desired transformation:
The RepeatingGroupMapping iterates over the Person elements of the input. It uses Map actions to write the data to the Name and ID elements of the output.
Running a Mapper
To run a mapper in Data Transformation Studio: 1. 2. 3.
Set the mapper as the startup component. Click Run > Run. In the I/O Ports table, double-click the input row and select the input XML file. Alternatively, you can set the example_source property of the mapper. This lets you test a mapper repeatedly on the same input, without needing to browse to the file each time.
Running a Mapper
185
4. 5.
When the execution is complete, Data Transformation Studio displays the Events view. Examine the events for any failures or warnings. View the mapping results by opening the output.xml file, located in the Results folder of the project.
Description A name that you assign to the component. Data Transformation includes the name in the event log. This can help you find an event that was caused by the particular component. A comment describing the component. If selected, Data Transformation ignores the component. This is useful for testing and debugging, or for making minor modifications in a project without deleting the existing components. By default, if a component fails, its parent component fails. If you select the optional property, the parent component does not fail. For more information, see Failure Handling on page 231. If the component fails, writes an entry in the user log. For more information, see Failure Handling on page 231.
remark disabled
optional
on_fail
Within a Mapper component, you can nest the following mapper anchors:
Mapper Anchor
AlternativeMappings EmbeddedMapper GroupMapping RepeatingGroupMapping
Description Defines a set of nested mappings, one of which is valid for the current XML context. Activates a secondary mapper. Binds its nested mapper anchors and actions together. Maps repetitive XML structures.
186
Mapper
A Mapper performs XML to XML transformations. It converts a source XML document to an output document that has a different XML structure. You must use the source and target properties to identify the root elements of the XML documents. For example, if the document element of the source is Persons, and the document element of the output is SummaryData, set the source and target as follows:
Description Under this property, insert a Locator component, and select the root of the source XML from a Schema view. For more information about this property, see Locators, Keys, and Indexing on page 191. Under this property, insert a Locator component, and select the root of the output XML from a Schema view. For more information about this property, see Locators, Keys, and Indexing on page 191.
target
Description The level of source XML validation that the mapper performs. The options are: - Partial. Permits some deviations from the schema. - Strict. Enforces the schema strictly. For more information, see Role of XSD in Serialization and Mapping on page 63. A sample XML source document. As you define the mapper components, the Studio color-codes them in the example source. When you run the mapper in the Studio, it operates on the sample document. The value of the property is an input port. For more information, see Ports on page 15. If you leave the example_source property blank, Data Transformation prompts you for a source document when you run the mapper. Nested within the example_source, you can assign a preprocessor that converts the source documents to a format that the mapper can accept. For example, the example_source might be a Microsoft Excel workbook configured with the ExcelToXml preprocessor. For more information, see Document Processors on page 23. The name of a root XML element that is not in the XSD schema of the input. For example, if the top-level element of the schema is Person, but the XML input nests Person in an element called InputWrapper, enter root_tag = InputWrapper. For more information about these properties, see Standard Mapper Properties on page 186.
example_source
root_tag
187
AlternativeMappings
This mapper anchor lets you define a set of alternative, nested mapper anchors. You can define a criterion for the alternative that the mapper should accept. Only the accepted alternative affects the mapper output. The other mapper anchors, whether failed or successful, have no effect on the mapper output.
Example
The input XML may contain a Product element or a Service element, but not both. You wish to process whichever element is in the input. Define an AlternativeMappings mapper anchor, and set its selector property to ScriptOrder. Within the AlternativeMappings, nest two Map actions. Configure one of them to process the Product element and the other to process Service.
Table 12-3. Basic Properties
Property
selector
Description The criterion for deciding which alternative to accept. The options are: - ScriptOrder. Data Transformation tests the nested mapper anchors in the sequence that they are defined in the IntelliScript. It accepts the first one that succeeds. If all the nested mapper anchors fail, the AlternativeMappings component fails. - NameSwitch. Data Transformation searches for the nested mapper anchor whose name property is specified in a data holder. It ignores the other nested mapper anchors. If the named mapper anchor fails, the AlternativeMappings component fails.
Description For more information about these properties, see Standard Serializer Properties on page 174.
EmbeddedMapper
This mapper anchor activates a secondary Mapper, which stores its output in the same output document.
Example
The XML input is a family tree. The input contains Person elements, which are recursively nested as shown:
<Person> ... <Person> ... <Person> ... </Person> </Person> </Person> <!-- Parent --> <!-- Child --> <!-- Grandchild -->
188
A Mapper can use an EmbeddedMapper component to call itself recursively, until all levels of nesting are exhausted.
Table 12-5. Basic Properties
Property
mapper schema_connections
Description The name of the secondary mapper. Connects the data holders that are referenced in the secondary mapper to the data holders that are referenced in the main mapper. The property contains a list of Connect subcomponents that define the correspondence. For more information, see Connect on page 103. If all the data holders in the main and secondary mappers are identical, you can omit this property. If there are any differences between the data holders, you must connect the data holders explicitly, even the ones that are identical. In the recursive example described above, Person should be connected to Person/Person. This instructs the secondary instance of the mapper to process a nested level of the input. It is sufficient to connect just the parent element (Person), and not the nested elements (Person/*s/Name, Person/*s/BirthDate, etc.), provided that the two Person elements have the same XSD type.
Description For more information about these properties, see Standard Serializer Properties on page 174.
GroupMapping
The GroupMapping mapper anchor binds its nested mapper anchors and actions together. You can set properties of the GroupMapping that affect the members of the group.
Table 12-7. Basic Properties
Property
source target
Description These properties are useful in situations where the mapper anchor must select specific occurrences of data holders. For more information, see Locators, Keys, and Indexing on page 191.
Description For more information about these properties, see Standard Mapper Properties on page 186.
RepeatingGroupMapping
This mapper anchor processes a repetitive structure in the input or output.
189
A RepeatingGroupMapping is useful if the XML input and/or output contains a multiple-occurrence data holder. It iterates over occurrences of the data holders. For more information, see Multiple-Occurrence Data Holders on page 67. Within the RepeatingGroupMapping, nest mapper anchors and actions that process each occurrence of the data holder.
Example
For more information, including an example of a RepeatingGroupMapping, see the Mapper Example on page 184.
Table 12-9. Advanced Properties
Property
count
Description The number of iterations to run. Enter a number, or click the browse button and select a data holder that contains the number. If blank, the iterations continue until the input is exhausted. A data holder, where the RepeatingGroupMapping should output the number of the current iteration. These properties are useful in situations where the mapper anchor must select specific occurrences of data holders. For more information, see Locators, Keys, and Indexing on page 191. For more information about these properties, see Standard Serializer Properties on page 174. If an iteration fails, writes an entry in the user log. User the on_fail property to write an entry if the entire RepeatingGroupMapping fails. Use on_iteration_fail to write an entry if a single iteration fails. For more information, see Failure Handling on page 231. For more information about these properties, see Standard Mapper Properties on page 186.
current_iteration
source target
name
on_iteration_fail
name
190
CHAPTER 13
Overview, 191 Example of Locators, 192 Example of Indexing by Key, 193 Source and Target Properties, 196 Standard Locator and Key Properties, 201 Locator and Key Quick Reference, 201 Locator and Key Component Reference, 202
Overview
In designing a transformation, a frequent issue is how to locate the data holders that you want to process. If the same data holders can occur multiple times in an XML structure, there can be ambiguities in identifying the occurrences. This chapter explains how to use the Locator and Key components to resolve the ambiguities. The components described in this chapter let you identify the occurrences of multiple-occurrence data holders in three ways:
Sequentially. Each iteration of a component processes the next occurrence of the data holder. By occurrence number. For example, a component can select the third occurrence of a data holder. By a key such as an attribute or a nested element. The key uniquely identifies the occurrence of the data holder.
The sequential approach is the default. It is subject to some complexities that you can control by using the Locator component. The occurrence number and key approaches are collectively known as indexing. The term is analogous to the index of a book, where you use a page number or a subject key to identify the location of information. You can implement the indexing by using components called LocatorByOccurrence, LocatorByKey, and Key. You can use the locator and key components in parsers, serializers, or mappers. You can use the components to identify the occurrences of data holders in the input, the output, or both. The locator components are nested in the source and target properties of various transformation components. The meaning and usage of the source and target properties is explained below.
191
Example of Locators
To understand the issues involved in identifying data holders, consider the following example. The example illustrates the use of:
We will explain the broad outline of the example here. In the following sections of the chapter, we will go back and explain how the target and the Locator work in detail.
The source document that the parser processes is a list containing a single employee per company:
John Marie
Incorrect Solution
Suppose you use the following RepeatingGroup to parse the source document:
192
The problem is that both Company and Employee are multiple-occurrence elements. The RepeatingGroup creates multiple Employee elements correctly, but it does not know that each Employee element should be nested in a separate Company element.
Correct Solution
To resolve the ambiguity, you can assign the target property of the RepeatingGroup.
The target identifies the data holder that the RepeatingGroup should create. The target contains a Locator component pointing to the Company element. This means that each iteration of the RepeatingGroup should create a new occurrence of the Company element. If you configure the RepeatingGroup in this way, the output is correct:
The source and target properties The Locator, Key, and LocatorByKey components
In the following sections of the chapter, we will explain the detailed operation of these properties and components.
Input
The input XML is a report listing the names of parents and their children.
For each parent, the XML lists a first name, a last name, and an ID. For each child, the XML lists a first name, a hobby, and the ID of the parent.
<Report> <Parents> <Parent id="1" firstName="John" lastName="Smith"/> <Parent id="2" firstName="Jane" lastName="Doe"/> </Parents> <Children> <Child name="Eric" hobby="Swimming" parentID="1"/> <Child name="Elizabeth" hobby="Biking" parentID="2"/> <Child name="Mary" hobby="Painting" parentID="1"/> <Child name="Edward" hobby="Swimming" parentID="2"/> </Children> </Report>
Output
The desired output is a list of hobbies and the children who engage in each hobby.
<Hobbies> <Hobby name="Swimming"> <Person firstName="Eric" lastName="Smith"/> <Person firstName="Edward" lastName="Doe"/> </Hobby> <Hobby name="Biking"> <Person firstName="Elizabeth" lastName="Doe"/> </Hobby> <Hobby name="Painting"> <Person firstName="Mary" lastName="Smith"/> </Hobby> </Hobbies>
2.
The transformation creates Hobby and Person elements. It identifies the Hobby element where it should nest each Person element as follows:
name
3. 4.
The transformation writes the child's first name into the Person element. The transformation writes the parent's last name into the Person element.
Mapper Configuration
The IntelliScript uses Key components to define identifiers for the data holders:
The first Key specifies that the id attribute is a unique identifier of a Parent element.
194
The second Key specifies that the name attribute is a unique identifier of a Hobby element.
The components of the Mapper configuration are described below. 1. The source property of the RepeatingGroupMapping specifies that each iteration should obtain its input from two data holders:
From an occurrence of the Child element From the corresponding occurrence of the Parent element
2.
The target property of the RepeatingGroupMapping specifies that each iteration should store its output in two data holders:
In an occurrence of the Person element In the corresponding occurrence of the Hobby element
3. 4.
The first Map action copies the name attribute of the Child to the firstName attribute of the Person. The second Map action copies the lastName attribute of the Parent into the lastName attribute of the
Person.
Use of Indexing
The example uses indexing by key to identify the occurrences of the Parent and Hobby data holders.
195
In the source property of the RepeatingGroupMapping, the indexing identifies the occurrence of Parent that corresponds to a Child. In the target property, the indexing identifies the occurrence of Hobby where a Person should be nested.
In parsers:
Parser Group RepeatingGroup EnclosedGroup FindReplaceAnchor
In serializers:
Serializer GroupSerializer RepeatingGroupSerializer
In mappers:
Mapper GroupMapping RepeatingGroupMapping
In all these categories, the meaning and usage of the properties is identical:
The source property identifies existing data holders that a transformation should use. The target property identifies data holders that may or may not already exist. If they exist, the transformation uses them. If they do not exist, the transformation creates them.
After you define the source and/or the target, the subsequent components use the identified data holders. For example, if you define the target of a Group, the anchors nested within the Group use the data holders that the target identifies.
Note: There are properties called source and target in some other components such as Map. These properties
have a different meaning and usage from the above. For an explanation, please see the components where the properties are used.
Source Property
The source property identifies existing occurrences of data holders. The value of the source property is a list containing one or more of the following components:
Source
Locator
Description Identifies a single-occurrence or multiple-occurrence data holder. In the latter case, each iteration accesses the next occurrence, in sequence. Identifies an occurrence of a multiple-occurrence data holder by using a key. Identifies an occurrence of a multiple-occurrence data holder by number.
LocatorByKey LocatorByOccurrence
Default Behavior
If you do not assign the source property of a component, the component identifies data holders in the following way:
If there is only one occurrence of the data holder, the component uses the existing occurrence.
196
If there are multiple occurrences of the data holder, the behavior is as follows:
In an iterative context, such as within a RepeatingGroupSerializer, each iteration accesses the next occurrence of the data holder in sequence. In a non-iterative context, such as a GroupSerializer that is not nested within an iterative component, the component accesses the first occurrence of the data holder.
In cases where a multiple-occurrence element is nested within another multiple-occurrence element. For more information, see Example 1: Nested Multiple-Occurrence Data Holders on page 197. In cases where the XSD schema permits alternative data holders, defined with xs:choice. In cases where the XSD schema permits a data holder to be missing, defined with minOccurs = 0.
You want to iterate over all the Employee elements and produce the following output:
John Leslie Pedro Marie Larry Frances
197
At first thought, you might create a RepeatingGroupSerializer and configure it to output the Employee data holder:
This does not work correctly! By default, each iteration selects a new instance of Employee within the same Company. The result is the output:
John Leslie Pedro
In other words, the RepeatingGroupSerializer accesses only the first Company. You can solve the problem by nesting the RepeatingGroupSerializer inside another RepeatingGroupSerializer. To resolve any potential ambiguities, you can configure the source properties explicitly:
Each iteration of the outer RepeatingGroupSerializer processes a different occurrence of Company. Each iteration of the nested RepeatingGroupSerializer processes a different occurrence of Employee. The result is the desired output. Alternatively, suppose you want to iterate only over the second Employee element in each Company. The desired output is:
Leslie Larry
You can do this by configuring a single RepeatingGroupSerializer, whose source is Company. This causes each iteration to access the next instance of Company. Within the iteration, you can configure a GroupSerializer,
198
whose source property uses a LocatorByOccurrence to select the second Employee. This generates the desired output.
Example 2: Indexing
In the Example of Indexing by Key at the beginning of this chapter, we used a RepeatingGroupMapping configured as shown below. In this example, the source property identifies two data holders:
It uses a Locator component to identify an occurrence of Child. Each iteration processes the next occurrence of Child, sequentially. It uses a LocatorByKey component to identify an occurrence of Parent. This causes each iteration to process the occurrence of Parent that corresponds to the occurrence of Child.
Target Property
The target property identifies an occurrence of a data holder that may or may not already exist. If the occurrence exists, the component uses it. If the occurrence does not exist, the component creates it. The value of the target property is a list containing one or more of the following components:
Target
Locator
Description Identifies a single-occurrence or multiple-occurrence data holder. In the latter case, each iteration creates a new occurrence. Identifies an occurrence of a multiple-occurrence data holder by an indexing key. If the occurrence does not yet exist, it is created. Identifies an occurrence of a multiple-occurrence data holder by number. If the occurrence does not yet exist, it is created along with any needed intervening occurrences. For example, if four occurrences exist, and LocatorByOccurrence specifies the tenth occurrence, occurrences 5-9 are also created, but left empty.
LocatorByKey
LocatorByOccurrence
199
Default Behavior
If you do not assign the target property of a component, the component identifies data holders in the following way:
If the schema permits only a single occurrence of the data holder, Data Transformation accesses or creates the occurrence. If the data holder can have multiple occurrences, the behavior is as follows:
In an iterative context, for example, within a RepeatingGroup, each iteration creates a new occurrence of the data holder. In a non-iterative context, for example, a Group that is not nested within an iterative component, the component creates one new occurrence of the data holder.
In cases where a multiple-occurrence element is nested within another multiple-occurrence element. For more information, see Example 1: Nested Multiple-Occurrence Data Holders on page 200. In cases where the XSD schema permits alternative data holders, defined with xs:choice. In cases where the XSD schema permits a data holder to be missing, defined with minOccurs = 0.
200
Example 2: Indexing
The Example of Indexing by Key, at the start of this chapter, illustrates how to use the target property with indexing. The target property of the RepeatingGroupMapping is configured as follows:
The target property identifies two data holders: It uses a Locator component to identify an occurrence of Person. Each iteration creates a new occurrence of Person.
It uses a LocatorByKey component to identify the occurrence of the Hobby element, where the occurrence of Person should be nested. If the Hobby element already exists, the transformation uses it. If the Hobby element does not yet exist, the transformation creates it.
Description If selected, Data Transformation ignores the component. This is useful for testing and debugging, or for making minor modifications in a project without deleting the existing components. By default, if a component fails, its parent component fails. If you select the optional property, the parent component does not fail. A comment describing the component.
optional
remark
Description Defines a unique identifier for a data holder. Identifies a single-occurrence or multiple-occurrence data holder. Identifies an occurrence of a multiple-occurrence data holder by using a key. Identifies an occurrence of a multiple-occurrence data holder by number.
201
Key
A Key defines attributes or elements that serve as a unique identifier of their parent element.
How to Define
You can define a Key only at the global level of the IntelliScript. This allows you to reference the Key anywhere in the project. The name of a Key is case-sensitive.
Example
The Example of Indexing by Key defines a key for the Hobby element in the following structure:
<Hobbies> <Hobby name="Swimming"> <Person firstName="Eric" lastName="Smith"/> <Person firstName="Edward" lastName="Doe"/> </Hobby> <Hobby name="Biking"> <Person firstName="Elizabeth" lastName="Doe"/> </Hobby> <Hobby name="Painting"> <Person firstName="Mary" lastName="Smith"/> </Hobby> </Hobbies>
The key is the name attribute, which uniquely identifies each Hobby.
Composite Keys
Optionally, you can define a list of data holders as a composite key. To do this, nest multiple data holders under the unique_fields property. Consider the following example:
<Persons> <Person ID="17" SubID="A">Bob</Person> <Person ID="17" SubID="B">Jane</Person> <Person ID="35" SubID="A">Larry</Person> </Persons>
Neither the ID attribute nor the SubID attribute identifies a Person element uniquely. The combination of ID and SubID, however, is a unique identifier. You can define ID and SubID as a composite key.
202
For example, this means that Persons/Person/SocialSecurity/@Number can be a valid key for Persons/ because @Number is nested within Persons/Person. On the other hand, Persons/Child is not a valid key for Persons/Person because it is not correctly nested.
Person,
The unique_fields must identify the closest ancestor that can have multiple occurrences. For example, if both Parent and Child are multiple-occurrence elements, then Parent/Child/@name can be a valid key for Parent/ Child but not for Parent. The unique_fields must have simple data types. They cannot be structures.
ID="1">John</Employee> ID="2">Leslie</Employee>
ID="1">Marie</Employee> ID="2">Larry</Employee>
The ID attribute can be a valid key for Employee because it uniquely identifies an Employee within a single Company. The duplication of ID values in different Company elements does not invalidate the key.
If two or more sibling occurrences of an input element have the same key values, Data Transformation considers each occurrence to overwrite the previous occurrences. It uses only the last occurrence that it encounters. If an occurrence of an input element is missing a key value, the occurrence is ignored. If Data Transformation outputs a keyed element, and a sibling element having the same key value already exists, the existing occurrence is overwritten.
203
The symbol
means that the name attribute has been defined as one of the unique_fields. The symbol
Description A multiple-occurrence element whose occurrences are identified by the key. The key.
Description For more information about these properties, see Standard Serializer Properties on page 174.
remark
Locator
This component is used in the source and target properties to identify a data holder. You can use it to identify either a single-occurrence or multiple-occurrence data holder. In the latter case, each iteration of the component that uses the Locator processes the next occurrence of the data holder.
Table 13-3. Basic Properties
Property
data_holder
Description The data holder that the component identifies. Select from a Schema view.
Description For more information about these properties, see Standard Serializer Properties on page 174.
LocatorByKey
This component is used in the source and target properties to identify an occurrence of a multiple-occurrence data holder. Before you use this component, you must define a Key at the global level of the IntelliScript. The Key specifies the data holders that uniquely identify the occurrence. In the LocatorByKey configuration, you must specify:
The key that you wish to use. The values of the key fields. You can specify the values either statically, by typing a value, or dynamically, by selecting a data holder that contains the value.
204
Description From a Schema view, select the XPath predicate representation of the key. For example, if you have defined Hobbies/Hobby/@name as a Key, then you can select Hobbies/ Hobby[@name=$1]. Under this property, specify the values of the parameters in the XPath predicate. ($1, $2, and so forth). Type each value, or click the Browse button and select a data holder that contains the value.
params
Description For more information about these properties, see Standard Serializer Properties on page 174.
LocatorByOccurrence
This component is used in the source property to identify an occurrence of a multiple-occurrence data holder, such as an element that can occur multiple times in an XML document or a variable that can occur multiple times. The component identifies the occurrence by number. For example, if there are ten occurrences of a data holder, you can use LocatorByOccurrence to process the third occurrence. LocatorByOccurrence can be used to iterate over the occurrences in a repeating structure such as a RepeatingGroup anchor. You can specify the occurrence number either statically, by entering a number, or dynamically, by selecting a data holder that contains the number.
Description The data holder that the component identifies. The number of the occurrence. Type a number, or click the Browse button and select a data holder that contains the number.
205
Description For more information about these properties, see Standard Serializer Properties on page 174.
optional remark
206
CHAPTER 14
Streamers
This chapter includes the following topics:
How a Streamer Works, 207 Creating a Streamer, 210 Streamer Quick Reference, 212 Streamer Component Reference, 212
The transformation parses each source segment as soon as it is available, rather than waiting until the entire source is received. The transformation has reduced memory requirements.
For example, suppose that an input stream contains stock market transaction data. The stream is transmitted to a server continuously over the course of the entire trading day. A streamer enables Data Transformation to process each transaction as soon as it arrives, rather than waiting until the end of the day. In another example, suppose that you receive a large source file over an FTP connection. By using a streamer, Data Transformation can start processing the file before it is completely received. Streamers are runnable components. The Streamer component is defined at the top-level of the IntelliScript, and it must be set as the startup component of the transformation. It functions by splitting its input into segments and passing them to other runnable components, which can be parsers, mappers, or serializers.
Segments
A streamer identifies segments of its input. It passes the segments individually to parsers, mappers, or serializers, which transform the segment data. A streamer assumes that the source is composed of:
For each type of segment, the streamer defines a parser, mapper, or serializer that processes the segment. The repeating segments can be either simple or complex. A simple segment is a single unit of data. A complex segment has its own nested header, repeating segments, and footer. Headers and footers are always simple segments.
Simple Segments
A simple segment has an opening marker that identifies where it starts, and a closing marker that identifies where it ends. Thus, a simple segment has the following structure:
Opening marker Data Closing marker
The streamer passes the segment to the specified transformation component, such as a parser. It is possible to omit some of the markers from the streamer definition. For example:
If you omit the opening marker of the source header, the header is assumed to start at the beginning of the source. If you omit the closing marker, then the segment ends at the opening marker of the next segment.
Complex Segments
A complex segment has a header and footer. Between the header and footer, it can contain any number of nested simple segments, for example:
Header Simple segment Simple segment Simple segment Footer
You can also define a complex segment that is missing the header or footer, for example:
Simple segment Simple segment Simple segment
The nested simple segments must all be of the same type. That is, they must all be identified by the same opening and closing markers.
Example
A data stream contains stock transaction data. The stream has the following structure:
The header begins with the string yy-MM-dd/, which is a date followed by a slash. The header contains various data, followed by the string ENDHEAD/. The repeating segments begin with the string TRANS HH:mm nnn/, where HH:mm is the time on a 24-hour clock, and nnn is a serial number of any length. The data stream ends with the string END/.
The following is a sample data stream conforming to this specification, where ... represents arbitrary data that must be parsed:
208
You can parse this stream by using a streamer having the following schematic structure. Notice that the opening and closing markers are located by searching for a particular pattern or string.
Segment
Header Repeating Footer
Type
Simple Simple Simple
Opening Marker
[0-9][0-9]-[0-9][0-9]-[0-9][0-9]/ TRANS [0-9][0-9]:[0-9][0-9] [0-9]+/ END/
Closing Marker
ENDHEAD/ none none
Header Concatenation
Optionally, you can configure a streamer to concatenate the header segment with each of the repeating segments. The streamer passes the concatenated result to a parser, mapper, or serializer. For example, suppose that a streamer passes the repeating segment to a parser. The source has the structure
Header Segment1 Segment2 Segment3
where Segment1, and so forth, are instances of the repeating segment. If you select the concatenation option, the streamer sends the following data to the parser:
HeaderSegment1 HeaderSegment2 HeaderSegment3
Output of a Streamer
A streamer generates an independent output document for each of the source segments.
This output is not well-formed XML because it contains multiple document elements.
209
<header>...</header> <?xml version="1.0" encoding="windows-1252"?> <repeating_segment>...</repeating_segment> <?xml version="1.0" encoding="windows-1252"?> <repeating_segment>...</repeating_segment> <?xml version="1.0" encoding="windows-1252"?> <footer>...</footer> </MyRoot>
Creating a Streamer
To create a streamer: 1. 2. 3. 4. 5. 6. 7. 8. 9.
Analyze the source structure and identify the segment types. Create or open a Data Transformation Studio project. In the project, configure a parser, mapper, or serializer that can process each type of simple segment. In the same project, configure a Streamer component. Within the Streamer, nest ComplexSegment and SimpleSegment components corresponding to the source structure. For each SimpleSegment, define the opening marker and closing marker if required. Define the parser, mapper, or serializer that processes the segment. Define the Streamer as the startup component of the project. Run the project on a source document. Open the output file, located in the projects Results folder, to view the output. If the streamer passes the segments to parsers, the output may fail to display because it contains multiple XML document elements. To solve this problem, wrap the output in a root tag. For more information, see Output of a Streamer on page 209.
10.
Examples
We present two streamer examples below to illustrate their configurations.
Example 1
The first example contains simple segments. Each segment has a predefined opening and closing marker.
210
The streamer passes the header and repeating segments to a parser called body_p. It passes the footer to a parser called foot_p.
Example 2
The following streamer contains a nested, repeating ComplexSegment. The nested ComplexSegment segment has its own header and nested, repeating SimpleSegment. The nested ComplexSegment does not have a footer. Notice that the property concat_header_to_repeating_segment has been selected. The effect of this property is to concatenate the header to each instance of the repeating segment. The streamer passes the concatenated segments to the parser body_p.
Creating a Streamer
211
Description Defines a source structure having a header, a repeating portion, and a footer. Defines the start and end of simple segments. Defines a source unit having an opening marker and a closing marker. Specifies the transformation that processes the unit. Splits a large source into segments for separate processing. A user-defined variable whose scope includes all segments of a streamer.
Streamer StreamerVariable
ComplexSegment
A ComplexSegment defines a source structure having a header, a repeating portion, and a footer.
Table 14-1. Basic Properties
Property
header_segment
Description The header portion of the source. Within this property, you can nest a SimpleSegment that defines the header. If you do not assign the property, the source is assumed not to contain a header. The repeating portion of the source. Within this property, you can nest a SimpleSegment that defines the repeating data. You can also nest a ComplexSegment that has its own header-repeating-footer structure. The footer portion of the source. Within this property, you can nest a SimpleSegment that defines the footer. If you do not assign the property, the source is assumed not to contain a footer.
repeating_segment
footer_segment
Description If selected, the system concatenates the header_segment to each instance of the repeating_segment. It passes the result of the concatenation to the run_component of the repeating_segment. For more information, see Header Concatenation on page 209.
212
MarkerStreamer
A MarkerStreamer defines the opening and closing markers of simple segments. It is simple to a regular Marker anchor, but it used only in streamers. For more information about how Data Transformation searches for markers, see Anchors on page 71.
Table 14-3. Basic Properties
Property
search
Description The way in which the MarkerStreamer finds text. The options are: - TextSearch. Searches for an explicit string. - PatternSearch. Searches for a regular expression. - OffsetSearch. Skips a predefined number of characters following the preceding reference point. - NewlineSearch. Searches for a newline character.
Description If selected, the MarkerStreamer must be adjacent to the end of the preceding segment. This is useful to ensure that the segments are not separated by any other text, including whitespace. Specifies from which occurrence of the marker to begin processing. Use count=3 to skip the first and second occurrences of the marker. Specifies whether the marker should be used as a reference point to identify the succeeding segment or marker. The possible values are: - full. Places a reference point before and after the current marker. - begin position. Before only. - end position. After only. A name for the marker, displayed in the event log. A description of the marker. If selected, the marker is disabled and not used.
count
marking
If the opening marker has marker = begin position, the innermost reference point is at the start. The entire marker is included in the segment. If the opening marker has marker = end position or full, the innermost reference point is at the end. The marker is excluded from the segment.
The inverse relationships apply to the closing marker. To illustrate this, consider a simple segment having the following structure:
BEGIN...data...END
A MarkerStreamer identifies the opening marker by searching for the text BEGIN. Another MarkerStreamer identifies the closing marker by searching for END. The following table illustrates how the marking property affects the segment boundaries.
Marking of Opening Marker full full Marking of Closing Marker full begin Segment Passed to the Transformation
...data... ...data...
213
Marking of Opening Marker full begin begin begin end end end
Marking of Closing Marker end full begin end full begin end
SimpleSegment
A SimpleSegment defines a data unit having an opening marker and a closing marker. It defines the parser, mapper, or serializer that should process the component. The markers are defined by using regular expressions. For more information about the regular expression syntax, see RegularExpression on page 125.
Table 14-5. Basic Properties
Property
opening_marker
Description A regular expression identifying the segment start. If omitted, the segment is assumed to start at the beginning of the source or at the end of the preceding segment. A regular expression identifying the segment end. If omitted, the segment is assumed to end at the end of the source or at the start of the next segment. A mapper or serializer that Data Transformation should use to process the data in the segment.
closing_marker
run_component
Streamer
The Streamer component splits its input into segments, and it passes each type of segment to a predefined parser, mapper, or serializer. The Streamer must be defined at the top-level of the IntelliScript, and it must be the startup component of the transformation. Within a Streamer, you must nest a ComplexSegment. The ComplexSegment, in turn, can contain nested SimpleSegment or ComplexSegment components.
Table 14-6. Basic Properties
Property
contains
Description A complex segment that defines the overall structure of the source.
Description If selected, Data Transformation ignores the component. This is useful for testing and debugging, or for making minor modifications in a project without deleting the existing components. A name that you assign to the component. Data Transformation includes the name in the event log. This can help you find an event that was caused by the particular component. A comment describing the component.
name
remark
214
Description An XML tag in which the Streamer wraps the combined output from all the segments. For more information, see Output of a Streamer on page 209. The maximum quantity of new data, in kilobytes, that the Streamer searches for each new segment. For optimal performance, set this property to approximately twice the maximum possible segment size. The default is 10,000 kilobytes. When an API or integration application activates a deployed Streamer service, it must set the chunk size parameter to a value that is smaller than the max_lookup_size.
max_lookup_size
StreamerVariable
A StreamerVariable is a user-defined variable whose scope includes all segments of a Streamer. For example, if a streamer contains three parsers, the value of a StreamerVariable is available to all three parsers. For example, a parser that processes a header segment might retrieve data from the header and store it in a StreamerVariable. Another parser, which process the repeating segment, can access the value of the StreamerVariable. You cannot use a regular Variable for this purpose because the value of the variable is not shared between segments. In other respects, the StreamerVariable component is similar to a regular Variable. However, a StreamerVariable must have a simple, single-occurrence data type. For more information, see Variables on page 64. You can define a StreamerVariable only at the top level of the IntelliScript.
Table 14-8. Basic Properties
Property
val_type
Description The XSD data type that the variable can store. Assign a simple type such as xs:string or xs:integer. Streamer variables do not support complex or multiple-occurrence types.
Description An initial value for the StreamerVariable, assigned when the transformation starts. Select InitialValue and enter the value.
215
216
CHAPTER 15
Project Properties
This chapter includes the following topics:
Overview
The project properties are options that you can set for the behavior of a project. They control essential features of the project such as the input and output encodings, the authentication support, and the XML validation. The project properties are saved with the project. They affect the behavior in all circumstances where you run the project:
In the Data Transformation Studio environment When you deploy the project as a Data Transformation service and run it in Data Transformation Engine.
For many projects, you can accept the default values of the project properties. Nevertheless, before you deploy a project as a service, always review the project properties and confirm that the settings meet your needs.
To set the project properties: 1. 2.
Select the project in the Data Transformation Explorer. Click File > Properties. Open a TGP script file belonging to the project in an IntelliScript editor. Click Project > Properties.
The preferences affect the display in Data Transformation Studio. They apply to all projects equally. The project properties affect the operation of a transformation both in Data Transformation Studio and in Data Transformation Engine. You can set the properties independently for each project.
217
Property Pages
The properties window organizes the properties in several pages. The following sections describe the properties on each page.
Info Properties
The Info page of the project properties displays general information, such as the storage location of the project.
Authentication Properties
Note: The authentication properties are supported for compatibility with projects created in earlier Data
Transformation versions. They are being phased out of the Data Transformation system. Do not use them in new projects. If the project accesses a location that requires a login, you can store the login information on the Authentication page of the project properties. This feature is useful, for example, if a parser processes source documents that are located on a password-protected web site. The options are as follows:
Option Enable authentication Prompt before execution Save in project Login information Description Select this option if the remote location requires a login. When a login is required, Data Transformation prompts the user to enter a user name and password. When a login is required, the project automatically submits the user name and password that are specified in the login information. The user name and password.
Encoding Properties
The Encoding page of the project properties lets you specify how the input, output, and working files of a project are encoded.
Supported Encodings
Data Transformation supports the following encodings:
Encoding Big5 Big5-HKSCS EBCDIC-37 EBCDIC-284 EBCDIC-424 EUC-KR GB2312 GB18030 ISO-8859-1 ISO-8859-2 ISO-8859-3 Description Chinese Chinese with Hong Kong Supplementary Character Set US/Canada Spanish Hebrew Korean Chinese Chinese Latin-1 (English and West European) Latin-2 (East European) Latin-3 (South European)
218
Encoding ISO-8859-4 ISO-8859-5 ISO-8859-6 ISO-8859-7 ISO-8859-8 ISO-8859-9 ISO-8859-15 KSC_5601 Shift_JIS TIS-620 UTF-16 UTF-16BE UTF-7 UTF-8 Windows-874 Windows-1250 Windows-1251 Windows-1252 Windows-1253 Windows-1254 Windows-1255 Windows-1256 Windows-1257 Windows-1258
Description Latin-4 (North European) Cyrillic Arabic Greek Hebrew Latin-5 (Turkish) Latin-9 Korean Japanese Thai Unicode Unicode Unicode Unicode Thai Central European Cyrillic ANSI English and West European Greek Turkish Hebrew Arabic Baltic Vietnamese
The proprietary Hebrew BaseCodePage continues to be supported in projects that were upgraded from previous Data Transformation versions. In new projects, use one of the other Hebrew code pages. Additional encodings may be supported. For an up-to-date list, select one of the Custom options on the Encoding page and open the drop-down list.
Input Encoding
The Input area of the Encoding page specifies how the source document of a transformation is encoded.
Option Extract code page from source Description If selected, Data Transformation uses a code page that is specified in the source document, for example, in the encoding attribute of an XML document. If Data Transformation does not find an encoding specification in the document, it uses the encoding defined in the settings described below. If selected, Data Transformation assumes that the input has the same encoding as the working files of the project, as defined in the Working area of the properties page. Select the encoding from the list.
Property Pages
219
Description The encoding of special characters: none or XML. In the XML encoding schema, symbols such as < or > are represented as entities, such as < and >. Serializers and mappers ignore this option. A serializer or mapper always assumes that its input uses the XML encoding schema. The byte order of binary data. The options are Little Endian, Big Endian, or no binary conversion. The default is Little Endian, which is appropriate for most data on the Windows operating system.
Byte order
Working Encoding
The Working area of the Encoding page specifies the encoding of the project's working files, including the TGP script files and the IntelliScript.
Option Use Data Transformation default codepage Custom Working encoding schema Description Uses the system default encoding. Select the encoding from the list. The encoding of special characters: none or XML, as for the input encoding.
You must select a working encoding that is compatible with the encoding of your XSD schema. For more information, see Encoding of the XSD Schema on page 57.
Output Encoding
The Output area of the Encoding page defines the encoding of the project output.
Option Use working encoding Same as input Custom Encoding schema Description Use the same encoding as for the working files. Use the same encoding as for the input. Select the encoding from the list. The encoding of special characters: none or XML, as for the input encoding. Parsers and mappers ignore this option. A parser or mapper always encodes its output using the XML encoding schema. The byte order of binary data. The options are Little Endian, Big Endian, or no binary conversion. The default is Little Endian, which is appropriate for most data on the Windows operating system.
Byte order
The input encoding should be the native encoding of the source document. The output encoding be the required encoding of the output document. In most cases, the working encoding should be identical to the input or output encoding. Try setting it to the same value as the output encoding. If that does not work, try setting it to the input encoding. The working encoding must be able to represent the language of the source document. Otherwise, it may be difficult to define components such as Marker anchors.
For the input and output encoding, you can use UTF-8, another Unicode encoding, or a double-byte code page such as Big5 or Shift_JIS. You can use the Binary Source view of the Studio to display the multiple bytes comprising a single character. For example, if you select a two-byte Chinese character in the example source, the Binary Source view highlights both bytes. In some languages, certain character combinations are written as a single symbol. In Hebrew, for example, consonants can be decorated with diacritical marks representing vowels. The consonant and the diacritic, although written as a single symbol, are actually two different UTF-8 characters. Data Transformation processes them as two characters. Certain symbols or strings look the same but represent different UTF-8 characters. For example, the trademark symbol is a single character. It is not the same as the two-character string TM. To process binary data, select a single-byte working encoding. Do not use UTF-8. Transformations might run more slowly if defined with the UTF-8 working encoding. For best performance, if the data is single-byte, use a single-byte working encoding.
Encoding Example
Suppose that you want to parse the following source document. The values of First Name, Last Name, and Gender use the Hebrew alphabet. The document has the Windows-1255 (Hebrew) encoding.
The desired output is an XML file containing English tag names and Hebrew data. The output is required in the UTF-8 encoding.
You can configure the parser project with the following encoding:
Input encoding = Windows-1255 (Hebrew) Working encoding = Windows-1255 (Hebrew) Output encoding = UTF-8
In the IntelliScript, you can configure anchors by using either English or Hebrew search text.
Property Pages
221
Namespaces Properties
The Namespaces page of the project properties is used to configure XML namespaces. You must define the namespaces in the targetNamespace attribute of the XSD schemas. In the project properties, you can edit only the namespace alias. For more information, see Data Holders on page 55.
When you display the XML in a browser such as Internet Explorer, the browser applies the stylesheet. Create event log By default, Data Transformation Studio generates event logs for the project. You can click the Advanced button and define the events to include in the log. If you deselect the option to create event logs, Data Transformation Studio does not generate an event log. The Events view displays only minimal information, such as the service initialization and termination. This property has no effect when you run a service in Data Transformation Engine. For more information about the Engine event logs, see the Data Transformation Engine Developer Guide. Specifies whether Data Transformation should save a copy of the parsed documents with the event log. The event log uses the copy to display the source of an event. Adds a binary byte-order mark at the start of the output file. Some Unicode applications use the mark to identify the encoding. By default, a transformation writes the output that it generates to a results file. If you select this option, the output is not written unless the parser or serializer runs an action such as WriteValue. The option is useful for debugging. If a transformation has multiple output ports, you must select this option. For more information, see AdditionalOutputPort on page 17. Disables XML output optimizations. The optimizations improve performance, especially when processing large documents. Do not select this option unless advised by an Informatica representative.
Save parsed documents Add binary encoding prefix to output file Disable automatic output
222
This option specifies what Data Transformation should do if a parser does not map any data to an XML element or attribute that is defined in the XSD schema. The options are: - Full. Data Transformation attempts to add the missing data holders to the XML output. It assigns the default value, if the schema defines one. It there is no default, it assigns 0 to integer data holders, 0.0 to floating data holders, or an empty value to other data holders. - Compact. Data Transformation does not add the missing data holders, and it removes empty data holders. A data holder containing the number 0 is not considered empty and is not removed. - As Is. The XML output contains the data holders that the parser explicitly set. Data Transformation does not add missing data holders, and it does not remove empty values. These options do not cause a parser to produce invalid XML, provided that you select the Validate Added option. Under certain conditions, however, the Compact or As Is options can cause a parser to output partial or empty XML. For example, suppose that you choose the Compact mode, and the parser does not create a required element. Data Transformation removes its parent element in an attempt to create valid XML. If the parent element is also required, the grandparent is removed. This process continues until Data Transformation reaches an optional element or until the XML is empty. These options instruct a parser to output XML elements or attributes that are required by the XSD schema. The options override the Compact and As Is output modes. Data Transformation assigns values to the required data holders as in the Full output mode. If selected, this option causes Data Transformation to validate the elements or attributes that it adds because of the Full or Add Required options. If adding the element or attribute would invalidate the output XML, Data Transformation does not add it. This option is selected by default. Deselecting the option may result in invalid XML. The options under this heading add a processing instruction at the beginning of the output XML, for example:
<?xml version="1.0" encoding="Windows-1252"?> Select the XML version and the value of the encoding attribute. For the encoding,
Processing instructions
you can choose the output encoding or a custom encoding designator. For more information, see Encoding Properties on page 218.
Property Pages
223
Explanation This option adds custom processing instructions to the XML header. Type the processing instructions, including the <??> symbols. This option lets you wrap the output XML in a tag that is not configured in the IntelliScript and possibly is not defined in the XSD schema. For example, if the output of a parser is <Result>1.0</Result> and you set the root element to OutputWrapper, the project generates the following output:
<OutputWrapper> <Result>1.0</Result> </OutputWrapper>
You must use this option if you run a parser on multiple source documents. For more information, see Running on Additional Source Documents on page 227.
224
CHAPTER 16
Overview, 225 Color-Coding the Example Source, 225 Running in Data Transformation Studio, 226 Viewing the Event Log, 228 Failure Handling, 231 Disabling a Component, 233
Overview
When you develop a transformation, test and debug it thoroughly before you put it into production. You can use several tools for testing, debugging, and troubleshooting, such as:
Color-coding a source document in the example pane of the IntelliScript editor Running the transformation in Data Transformation Studio and viewing the results Viewing the event log that Data Transformation generates for each run Cross-identifying an anchor in the example source, in the IntelliScript, and in the Events view
In the learn-example style, the specific anchors that you use to define the document structure are colorcoded, for example, the anchors in the first iteration of a repeating group. In the mark-example style, all the anchors that Data Transformation finds in the document are color coded, for example, all iterations of a repeating group.
By examining the color-coded text, you can confirm that the parser identifies the anchors correctly.
225
You can choose the following options from the IntelliScript menu or the toolbar to control the color-coding style:
Option Learn the Example Automatically Learn Example Description Enables automatic color coding in the learn-example style. When you define anchors in the IntelliScript, the Studio automatically colors the corresponding location in the example. Color-codes the anchors in the learn-example style. You can use this command to activate the color coding if you have deselected the option to Lean the Example Automatically. You can also use this command to return to the learn-example style, after you have displayed the mark-example style. Runs the parser and color-codes the anchors in the mark-example style. Stops the color-coding operation. If the example source is very long, you can use this option to halt the color display and speed up the response.
Set the startup component of the project. The startup component is a parser, serializer, mapper, global transformer, or streamer that the project should activate. You can set the startup component in any of the following ways:
In an IntelliScript editor, right-click the component and click Set as Startup Component. In the Component view, right-click the component and click Set as Startup Component. Click Run > Run and select the component from a list.
2.
226
Set the example_source property of a parser, serializer, or mapper. Set the sources_to_extract property of a parser. Click Run > Run. Click the Details button. On the I/O ports tab, edit each row, assigning a file to each input port.
Alternatively:
If you do not define an input and attempt to run a transformation, the Studio opens the Run window where you can fill in the above information.
3.
Optionally, set the initial values of the variables defined in the project.
Click Run > Run. Click the Details button. On the Parameters tab, edit each row, assigning a value to each variable.
The initial values that you assign in this way are used when you test the project in the Studio. For this purpose, they override the initialization property of the variables. They have no effect when you later deploy the project as a service. For more information, see Initializing Variables at Runtime on page 66.
4.
Click Run > Run, then click the Run button Click Run > Run StartupComponentName
The Studio displays the Events view, which informs you of any problems that occurred in the execution. For more information, see Viewing the Event Log on page 228.
5.
View the results by double-clicking the output file, in the Results folder of the Data Transformation Explorer.
Examine the Events view for a description of the problem. For more information, see Viewing the Event Log on page 228. Try opening the output file in an external application such as Notepad. If the output file is not created at all, examine the Output Control page of the project properties, and confirm that the option to Disable Automatic Output is not selected. For more information, see Output Control Properties on page 222.
Parsers
To test a parser on additional source documents, assign the sources_to_extract property. The value is a single file or a set of files, optionally containing wildcards. For more information, see Parser on page 13. If you select multiple sources, you must select the option to add an XML root element. If you do not do this, the XML that the parser generates is not well formed because it does not have a unique root. For more information, see Project Properties on page 217.
227
Event-Log Properties
In the project properties, you can configure the events that Data Transformation writes to the log. For more information, see Output Control Properties on page 222.
The types of events that the Studio displays, such as notifications, warnings, or failures. Whether the failure events propagate (bubble up) in the events tree. Propagation lets you find the failure events more easily because they are labeled at the top levels of the tree.
The preferences are independent of the event-log properties. The properties control the events that the system stores in the log. The preferences control how the stored events are displayed.
To configure event preferences: 1. 2. 3.
Click Window> Preferences. Select the Data Transformation Events category. Under the Filters heading, choose the events you want Data Transformation to display. The choices are:
4.
228
Alternatively, select individual events to be propagated and click Propagate Selected Events. You can select multiple events by pressing Control and clicking on each relevant row.
Figure 16-3. Event Display without Propagation
229
Warnings, failures, and optional failures may be perfectly normal under some circumstances. For example, a RepeatingGroup anchor may display an optional failure after its last iteration because it does not find more data to parse. If the event log displays warnings or failures, investigate why they occur, and determine whether they are normal or signal a problem.
Cross-Identifying Events
To help you trace the operation of a transformation and diagnose problems, you can identify the events and the components that caused them in the following ways:
In the right pane of the Events view, double-click an event corresponding to an anchor, such as a Marker or Content event. The anchor that caused the event is highlighted in the IntelliScript and in the example source. In the example source, right-click an anchor and choose the following options:
View Instance. Finds the anchor definition in the IntelliScript. View Event. Finds the corresponding event in the Events view.
In the IntelliScript, right-click an anchor and choose View Marking. This finds the corresponding text in the example source.
230
Failure Handling
A failure is an event that prevents a component from processing data in the expected way. An anchor might fail if it searches for text that does not exist in the source document. A transformer or action might fail if its input is empty or has an inappropriate data type. A failure can be a perfectly normal occurrence. For example, a source document might contain an optional date. A parser contains a Content anchor that processes the date, if it exists. If the date does not exist in a particular source document, the Content anchor fails. By configuring the transformation appropriately, you can control the result of a failure. In the above example, you might configure the parser to ignore the missing data and continue processing. The event log displays warnings about failures. In addition, you can configure a transformation to write a failure message in a user log.
Rollback
If a component fails, its effects are rolled back. For example, suppose that a Group contains three non-optional Content anchors, which store values in data holders. If the third Content anchor fails, the Group fails. Data Transformation rolls back the effects of the first two Content anchors. The data that the first two Content anchors already stored in data holders is removed. The rollback applies only to the main effects of a transformation, such as a parser storing values in data holders or a serializer writing to its output. The rollback does not apply to side effects. In the above example, if the Group contains an ODBCAction that performs an INSERT query on database, the record that the action added to the database is not deleted.
Group //Failed Content //Data holder is rolled back Content //Data holder is rolled back ODBCAction //INSERT query is not rolled back Content //Failed
Failure Handling
231
Edit the advanced properties of a component in the IntelliScript Right-click the component and click Make Optional or Make Mandatory
Failure level: Information, Warning, or Error Name of the component that failed Failure description Location of the failed component in the IntelliScript Additional information about the transformation status, such as the values of data holders.
Parsers and anchors Serializers and serialization anchors Mappers and mapper anchors Writes an error message containing the VarLastFailure system variable to the user log. Same as LogError, but displays the message as a warning rather than an error.
Same as LogError, but displays the message as information rather than an error.
CustomLog.
Runs a serializer that writes a custom message to the user log or another location. For more information, see CustomLog on page 144.
If the Marker does not exist in the source document, the system writes the following entry in the user log:
*** INFO *** : Marker, [MyParser[11].Marker], Can't find Marker<optional>('Height').
232
By default, each execution of a transformation generates a user log having a unique name:
<service_name>+<unique_string>.log
A transformation can set the user-log location at runtime by using SetValue actions to assign the following system variables. Set the phase property of SetValue to initial, ensuring that SetValue runs before any component that writes to the user log.
Variable
VarServiceInfo/StandardError/StandardErrorDir VarServiceInfo/StandardError/StandardErrorName
Description Directory path of the user log File name of the user log
In the following example, a SetValue action sets the user-log directory to c:\mydirectory.
Disabling a Component
As you develop and test a transformation, you may wish to disable certain of its components temporarily. For example, if a particular anchor fails, you can disable the anchor and test the transformation without it. You can enable or disable a component by setting its disabled property, in either of the following ways:
Edit the advanced properties of a component in the IntelliScript. Right-click the component and click Enable or Disable.
Disabling a Component
233
234
CHAPTER 17
Overview, 235 Preparing a Project for Deployment, 235 Data Transformation Repository, 236 Deploying a Service in a Development Environment, 237 Deploying a Service to a Production Server, 238 Running a Service, 238
Overview
When you finish configuring and testing a transformation, you can deploy it as a Data Transformation service. This lets Data Transformation Engine access and run the project. You can deploy a service both in the development environment where you use Data Transformation Studio and on production servers. Deploying in the development environment allows you to develop and test applications that activate the service. Deploying in the production environment allows your applications to run the services on live data.
Note: There is no relation between Data Transformation services and Windows services. You cannot view or
Setting the startup component Removing any testing or debugging settings that might be inappropriate in a deployed service
235
You may have inserted components such as WriteValue for debugging purposes. You can remove the debugging components. You may have used the sources_to_extract property of a parser to test multiple source files quickly. You can delete the property value. On the XML Generation page of the project properties, you may have selected the option to Add an XML Root Element to support multiple output documents from a single run. You can deselect the option. If you configured event logging on the Output Control tab, review the settings. You may have configured the initialization property of variables. If you plan to pass the initial values as service parameters from an application, you can delete the initialization properties, or you can leave them as defaults. For more information, see Initializing Variables at Runtime on page 66.
On the computer where you plan to deploy the service, open the Data Transformation Configuration Editor. View or edit the following setting:
CM Configuration/CM Repository/File system/Base Path
236
In Data Transformation Studio, open and select the project. Click Project > Deploy. In the Deploy Service window, set the following options:
Option Service Name Description The name of the service. By default, this is the project name. To ensure cross-platform compatibility, the name must contain only English letters ( A-Z, a-z), numerals ( 0-9), spaces, and the following symbols:
% & + - = @ _ { }
Data Transformation creates a folder having the service name, in the repository location. Label Startup Component Author Description 4. A version identifier. The default value is a time stamp indicating when the service was deployed. The runnable component that the service should start. The person who developed the project. A description of the service.
Click the Deploy button. The Studio displays a message that the service was successfully deployed. The service appears in the Repository view.
Redeploying
Data Transformation Studio cannot open a deployed project that is located in the repository. If you need to edit the transformation, work on the original project and redeploy it.
To edit and redeploy a project: 1. 2.
Open the development copy of the project in Data Transformation Studio. Edit and test it as required. Redeploy the service to the same location, under the same service name. You are prompted to overwrite the previously deployed version.
Redeploying overwrites the complete service folder, including any output files or other files that you have stored in it.
In Data Transformation Studio, display the Repository view. Right-click the service and click Remove. This removes only the copy in the repository. It has no effect on the development copy of the project in the Studio workspace.
Deploying a Service in a Development Environment 237
Deploy the service on the development computer. For more information, see Deploying a Service in a Development Environment on page 237. Copy the deployed project directory from the Data Transformation repository on the development computer to the repository on the remote computer. For more information about the repository locations, see Data Transformation Repository on page 236. If you have added any custom components or files to the Data Transformation autoInclude\user directory, you must copy them to the autoInclude\user directory on the remote computer. For more information about custom components and the autoInclude directory, see the Data Transformation Engine Developer Guide.
3.
4.
Data Transformation Engine determines whether any services have been revised by periodically examining, by default every 30 seconds, the timestamp of a file called update.txt. This file exists in the repository root directory. The content of the file can be empty. If this is the first time that you have deployed a service to the remote repository, update.txt might not exist. If so, copy it from the local repository. If update.txt exists, update its timestamp as follows:
On Windows: Open update.txt in Notepad and save it. On UNIX: Open a command prompt, change to the repository directory, and enter the following command: touch update.txt
Alternatively, if the development computer can access the remote file system, you can change the Data Transformation repository to the remote location and deploy directly to the remote computer.
Running a Service
After you deploy a service, you are ready to run it in Data Transformation Engine. You can do this in several ways:
By using the Data Transformation Engine command-line interface. By programming an application that uses the Data Transformation API to submit source documents to the Engine. The API is available in several programming languages. By using the Unstructured Data Transformation in Informatica PowerCenter. By using integration agents that run Data Transformation services within third-party systems.
For more information, see the Data Transformation Engine Developer Guide.
238
Index
A
AbsURL component 110 Accelerator specifying mapper APIs
reference 83 relation to delimiters 72 relation to XML 72 serialization 167, 172 using transformers 106 Engine 238 AppendListItems component 140 AppendValues component 141 applications services 235 architecture transformations 1 arithmetic computations 141 Asian languages UTF-8 encoding 220 assigning value to output 157 attributes data holders 55 AttributeSearch component 98 authentication project properties 218
B
Base64Decode component 112 Base64Encode component 112 BidiConvert component 112 BigEndianUniToUni component 112 BinaryFormat component 45 BIRT XmlToDocument report generator 34
C
CalculateValue component 141 CDATADecode
239
component 113 CDATAEncode component 113 ChangeCase component 113 Chinese UTF-8 encoding 220 CMW files 4 code pages supported 218 transforming 117 XSD schema 57 color coding Learn Example 225 Mark Example 225 use in debugging 225 combinations of lists 142 CombineValues component 142 COMClass component 161 CommaDelimited component 48 command-line interface Engine 238 complex segments streamer 208 ComplexSegment component 212 components overview 2 concatenation 141 strings 140 condition ensuring in source document 147 Connect component 103 Content 71 component 85 ContentSerializer component 172, 176 CreateGuid component 114 CreateList component 143 CreateUUID component 114 CustomFormat component 45 CustomLog component 144
D
data holders 55 destroying occurrences 68 identifying source and target 196 indexing multiple-occurrence 191 mixed content 61 overview 3
240 Index
single or multiple occurrence 67 validating 59 database lookup transformer 124 querying 152 databases connecting to 134, 162 DateAdd component 145 DateAddICU component 145 DateDiff component 145 DateDiffICU component 145 DateFormat component 116 DateFormatICU component 114 dates format of 114 debugging transformations 225 default transformers in format 106 DelimitedSections component 86 DelimitedSectionsSerializer component 177 Delimiter component 51 DelimiterHierarchy component 48 delimiters custom hierarchy 45 relation to anchors 72 derived XSD types XSI type 62 direction property of anchors 76 DLLs using .NET as custom actions 149 DocList component 19 document processors 23 custom C++ 29 custom COM 28 custom Java 28 defining 23 installation 23 quick reference 25 reference 25 running multiple 32 documents overview 3 Dos96HebToAscii component 116 DownloadFile component 146 DownloadFileToDataHolder component 146 drag-and-drop defining anchors 75
E
EbcdicToAscii component 116 Eclipse Studio hosted in 7 EDI delimiters for parsing 49 elements data holders 55 EmbeddedMapper component 188 EmbeddedParser component 88 EmbeddedSerializer component 178 enclosed group 89 EnclosedGroup component 89 EnclosingDelimiters component 52 EncodeAsUrl component 116 Encoder component 117 encoding code page transformer 117 input and output 218 limitations on working 221 supported 218 XSD schema 57 Engine running services in 238 EnsureCondition component 147 errors failure handling 231 viewing 229 event log as debugging tool 228 configuring properties 228 custom events 140 Engine 230 viewing 228 events finding anchors 230 example source in project 4 example_source property mapper 187 serializer 175 examples installing and opening online 6 Excel generating from XML 36
parsing as HTML 26 parsing as text 27 parsing as XML 26, 27 ExcelToDataXml component 26 ExcelToHtml component 26 ExcelToTextML component 26 ExcelToTxt component 27 ExcelToXml component 27 ExcludeItems component 148 ExpandFrameSet component 27 ExternalCOMAction component 149 ExternalCOMPreProcessor component 27 ExternalJavaPreProcessor component 28 ExternalPreProcessor component 29 ExternalTransformer component 117 extracting content Content anchor 85
F
failure effect on parent 231 failure events 229 generated by RepeatingGroup 96 failure handling variables for 66 failures handing 231 viewing 229 fatal error event 229 files downloading 146 projects 4 FileSearch component 19 FindReplaceAnchor component 90 footer segment streamer 207 format preprocessors 52 FormatNumber component 118 forms processing PDF 30 submitting HTML 93, 158, 159 frameset parsing HTML 27 FromFloat
Index 241
component 119 FromInteger component 119 FromPackDecimal component 119 FromSignedDecimal component 120
G
get HTTP method 159 Group component 91 group performing actions on 91 repeating 95 GroupMapping component 189 GroupSerializer component 179
events 229 ImageClick component 103 indexing 191 example 193 multiple-occurrence data holders 68 quick reference 201 information events 229 initialization variables 66 InjectFP component 121 InjectString component 121 InlineTable component 134 InputPort component 20 IntelliScript defining anchors in 75 iterations RepeatingGroup anchor 95
H
handling failures 231 header segment streamer 207 Hebrew code-page conversion 120 hebrewBidi component 120 HebrewDosToWindowsTransformer component 120 HebrewEBCDICOldCodeToWindows component 120 hebUniToAscii component 120 hebUtf8ToAscii component 120 HL7 component 49 HTML removing tags 127 submitting form 158, 159 transforming entities 120 HtmlEntitiesToASCII component 120 HtmlForm component 93 HtmlFormat component 46 HtmlProcessor 46 component 53, 121 HTTP Get and Post data 65
J
Japanese UTF-8 encoding 220 JavaScript syntax reference 147 JavaScriptFunction component 151 JavaTransformer component 122
K
Key component 202 key properties of 201
L
LearnByExample component 99 list types mapping to 81 XSD 68 lists combining 142 creating 143 multiple-occurrence data holders 67 of variables 68 sorting 157 LocalFile component 20 locations marking in source document 94 Locator
I
icons
242 Index
component 204 LocatorByKey component 204 LocatorByOccurrence component 205 locators properties of 201 log options events 222 login project properties 218 logs event 228 LookupTransformer component 123 loop RepeatingGroup anchor 95
multiple-byte characters UTF-8 encoding 220 multiple-occurrence data holders combining 142 creating lists in 143 indexing 191 mapping anchors to 73 sorting 157
N
namespaces project properties 222 New Element window defining anchors in 75 NewlineSearch component 99 NormalizeClosingTags component 124 numbers formatting 118
M
Map component 152 Mapper component 187 mapper calling secondary 188 creating 183 input validation 63 mapper anchors properties of 186 reference 187 mappers deploying as service 236 properties of 186 quick reference 186 running 185 running in parser 154 using indexing 193 Marker 71 component 94 markers in streamers 210 MarkerStreamer component 213 marking property of anchors 76, 213 missing data failure handling 231 missing text searching by optional Group 92 mixed content data holders in 61 in XSD schema 57 mapping to 73 ModifyField component 103 MSMQOutput component 162 multiple occurrence data holders 67 destroying occurrences 68 variables 68
O
ODBC_Text_Connection component 134 ODBC_XML_Connection component 162 ODBCAction component 152 ODBCLookup component 124 offset dynamically defined 100 OffsetSearch component 100 online samples installing and opening 6 OpenURL component 162 optional failure effect on parent 231 event 229 optional property events 229 failure handling 231 setting 232 output viewing 227 OutputCOM component 163 OutputDataHolder component 164 OutputFile component 164 OutputPort component 20
Index
243
P
packed decimals 130 numbers 119 parameters passing to transformation 66 Parser component 13 parsers calling secondary 88, 178 creating 9 deploying as service 236 running 12 running secondary 155 path resolving relative 110 pattern matching regular expressions 125 patterns segment opening and closing 208 PatternSearch component 100 PDF processing PDF forms 30 PDF conversion configuring 39 PDF files using PdfToTxt_4 processor 36 PDF output XmlToDocument postprocessor 34 PDF support converting PDF files 31 PdfFormToXml_1_00 component 30 PdfToTxt_2 component 31 PdfToTxt_3 component 31 PdfToTxt_4 component 31 using 36 phase of anchor search 77 phase property of anchors 76 phases nested 77 platform independence parsers 12 Positional component 49 post HTTP method 158 posted data retrieving 65 postprocessor XmlToDocument 34 PostScript component 50 PowerPoint parsing as HTML 31 PowerpointToHtml component 31
244 Index
PowerpointToTextML component 31 predicate XPath 204 pre-processors defining 23 preprocessors document 23 format 52 ProcessByTransformers component 32 processing instructions adding to output 223 ProcessorPipeline component 32 processors custom C++ 29 custom COM 28 custom Java 28 document 23 installation 23 reference 25 using transformers as 106 project configuration overview 5 deploying 6 project properties 217 authentication 218 encoding 218 general information 218 namespaces 222 output control 222 setting 217 versus preferences 217 XML generation 223 projects architecture 3 deploying as service 235 properties of actions 138 of anchors 76 of mappers 186 of serializers 174 of transformers 107 project 217
Q
quick reference anchors 82 document processors 25 indexing 201 mappers 186 serializers 174 streamers 212
R
reference anchors 83 delimiters 47
document processors 25 format preprocessors 52 formats 45 indexing 202 mapper anchors 187 mappers 186 parsers 13 serialization anchors 175 reference points around anchors 76, 213 Marker anchor 94 of search scope 79 regex regular expressions 125 regular expressions syntax 125 RegularExpression component 125 reloading schema 59 RemoveField component 104 RemoveMarginSpace component 127 RemoveRtfFormatting component 127 RemoveTags component 127 repeating segment streamer 207 RepeatingGroup component 95 RepeatingGroupMapping component 189 RepeatingGroupSerializer component 180 Replace component 128 replacing text 131 in source document 90 report generator BIRT 34 repository services 236 requirements analysis transformation 5 ResetVisitedPages component 154 Resize component 128 ResultFile component 165 Results folder 4 results of transformation 227 results file debugging if not displayed 227 retrieving content Content anchor 85 ReverseTransformer component 129 right-to-left text
reversing 112, 120 rollback after failure 231 root element adding XML 224 RTF component 50 RtfFormat component 46 RtfProcessor component 53, 129 RtfToASCII component 129 RtfToTextML component 32 RunMapper component 154 runnable components deploying as service 236 RunParser component 155 RunSerializer component 156
S
samples installing and opening online 6 schema encoding 57 schemas adding XSD to project 58 creating 58 editing 58 reloading 59 sample XSD 56 validation 62 viewing 59 XSD 55 script files TGP 4 search anchor direction 76 dynamically defined search string 101 search criteria for anchors 77 search scope adjusting 79 for anchors 77 searcher components 80, 98 secondary mapper EmbeddedMapper anchor 188 secondary parser EmbeddedParser anchor 88, 178 SegmentIndex component 104 segments processing in streamer 207 SegmentSearch component 100 SegmentSize
Index 245
component 104 select-and-click defining anchors 75 serialization mode 14, 169 using transformers in 107 serialization anchors 167, 172 defining 173 properties of 174 reference 175 sequence of operation 173 Serializer component 175 creating with wizard 170 serializer controlling auto-generation 168 input validation 63 serializers 167 creating from parser 168 deploying as service 236 properties of 174 quick reference 174 running 172 running in parser 156 troubleshooting auto-generated 170 service name variable storing 66 service parameters passing to transformation 66 ServiceLocation variable 65 services deploying 235 deploying in development environment 237 removing 237 repository 236 running in Engine 238 transformation types 3 updating 237 SetValue component 157 SGML component 50 signed decimals 130 numbers 120 SimpleSegment component 214 single occurrence data holders 67 Sort component 157 sorting multiple-occurrence data holders 157 source property 196 source documents testing in Studio 227 sources_to_extract property 14 SpaceDelimited component 50 splitting files 160
246 Index
splitting large inputs streamer 207 startup components setting 226 streamer complex segments 208 component 214 creating 210 footer segment 207 header concatenation 209 header segment 207 output 209 repeating segment 207 running in API 212 segment opening and closing patterns 208 splitting large inputs 207 streamers quick reference 212 StreamerVariable component 215 strings concatenating 140, 141 StringSerializer 181 component 172 Studio instructions for use 7 overview 1 SubmitAll component 104 SubmitClick component 104 SubmitForm component 158 SubmitFormGet component 159 SubString component 129 system variables 64 system time variable 65
T
TabDelimited component 50 table configuration editor PdfToTxt_4 36 tables processing PDF 36 target property 196, 199 test documents in project 4 testing transformations 225 Text component 20 TextFormat component 46 TextML XML schema 36
TextSearch component 101 TGP files script 4 time system 65 times format of 114 ToFloat component 129 ToInteger component 130 ToPackDecimal component 130 ToSignDecimal component 130 TransformationStartTime component 131 TransformByParser component 131 TransformByProcessor component 132 TransformByService component 132 TransformerPipeline component 133 transformers 105 as document preprocessors 106 compared to actions 138 custom DLL 117 custom Java 122 default 106 defining 105 deploying as service 236 global stand-alone 107 in serialization 107 properties of 107 sequences of 106 using as document processors 32 using in anchors 106 troubleshooting transformations 225 types XSI 62 TypeSearch component 102
V
validation data holders 59 ensuring for XML output 223 XML 62 XML input 63 XML parser output 63 VarCurrentPost variable 65 VarCurrentURL variable 65 VarFormAction variable 65 VarFormData variable 65 Variable component 67 variables data holders 55 in streamers 210 initialization 66 lists 68 mapping anchors to 66, 73 system 64 user-defined 64 using in actions 66 VarLastFailure variable 66 VarLinkURL variable 65 VarPostData variable 65 VarRequestedURL variable 65 VarServiceInfo variable 66 VarSystem system time 65
W
warning event 229 warnings viewing 229 WestEuroUniToAscii component 133 Word parsing as HTML 32 parsing as RTF 33 parsing as text 33 parsing as XML 33 WordPerfectToTextML component 32 WordToHtml component 32 WordToRtf component 33 WordToTextML component 33 WordToTxt
Index 247
U
Unicode working encoding 220 UNIX designing parsers for 12 unknown event 229 URL component 21 relative to absolute 110 URLs specifying connections 65 user log variable defining location 66
mapping data holders 62 XSLT running transformations 160 XSLTMap component 160 XSLTTransformer component 133
X
Xerces XML validation 63 XML adding empty tags 111 as parser input 9 generating sample 59 mapping anchors to 72 processing instruction 223 validation 63 XSD schemas 55 XSLT transformation 133 XML attributes data holders 55 XML elements data holders 55 XML generation project properties 223 XML Spy XSD editor 57 XML validation ensuring 223 XmlFormat component 47 XMLLookupTable component 134 XmlToDocument component 34 XmlToExcel component 36 XPath modified notation 61 predicate 204 XPaths validating 59 XSD adding schema to project 58 background 56 creating schemas 58 editing 58 editors 56, 57 included schemas 57 IntelliScript representation 61 sample schema 56 schema encoding 57 schemas 55 unsupported features 58 viewing 59 XSD data types searching for 80 XSD schemas in project 4 XSI types
248 Index
NOTICES
This Informatica product (the Software) includes certain drivers (the DataDirect Drivers) from DataDirect Technologies, an operating company of Progress Software Corporation (DataDirect) which are subject to the following terms and conditions: 1. THE DATADIRECT DRIVERS ARE PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. 2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT INFORMED OF THE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT LIMITATION, BREACH OF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS.