Troubleshooting Guide
Troubleshooting Guide
Troubleshooting Guide
Version 9 Release 1
Troubleshooting Guide
SC19-3804-00
Troubleshooting Guide
SC19-3804-00
Note Before using this information and the product that it supports, read the information in Notices and trademarks on page 91.
Copyright IBM Corporation 2008, 2012. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
Troubleshooting InfoSphere DataStage 1
Troubleshooting problems when starting an InfoSphere DataStage and QualityStage client . . . 1 Failure to connect to services tier: invalid host name . . . . . . . . . . . . . . . . 1 Failure to connect to services tier: invalid port . . 2 IBM WebSphere Application Server fails to start: AIX and Linux. . . . . . . . . . . . . 3 Cannot authenticate user . . . . . . . . . 5 Troubleshooting scheduled jobs . . . . . . . . 5 Resolving scheduling problems on Windows engine tier hosts . . . . . . . . . . . . 6 Resolving scheduling problems on UNIX and Linux servers . . . . . . . . . . . . . 8 Resolving job termination problems . . . . . . 10 Resolving problems with database stages on 64-bit systems. . . . . . . . . . . . . . . . 10 Resolving ODBC connection problems on UNIX and Linux systems . . . . . . . . . . . . . 10 Testing ODBC driver connectivity . . . . . . 10 Checking the shared library environment . . . 11 Checking symbolic links . . . . . . . . . 12 Resolving configuration problems on UNIX systems 12 Running out of file units . . . . . . . . . 12 Running out of memory on AIX computers. . . 13 Troubleshooting Designer client errors . . . . . 14 Handling exceptions in the Designer client . . . 14 Viewing log files and error reports . . . . . 15 Troubleshooting a failure to submit jobs when you run a column analysis . . . . . . . . . . . 16 Troubleshooting login failures . . . . . . . . 16 Client side login failures . . . . . . . . . 16 Server-side login failures . . . . . . . . . 18 Server-rich client login failures . . . . . . . 20 Troubleshooting job design issues . . . . . . . 24 IBM InfoSphere DataStage Error: Job xxx is being accessed by another user . . . . . . . . . 24 DataStage Parameter Set - Parameter Set locked by non-existent user . . . . . . . . . . 25 Cannot get exclusive access to the log for a job. 25 Troubleshooting problems when creating InfoSphere DataStage projects . . . . . . . Troubleshooting job failures . . . . . . . . . Low system resource issues . . . . . . . . Disk space issues . . . . . . . . . . . Disk lookup issues . . . . . . . . . . . Data processing failures . . . . . . . . . DataStage timeout variables . . . . . . . . Troubleshooting Specific Stages . . . . . . . . DB2 Connector Stage . . . . . . . . . . Join Stage . . . . . . . . . . . . . . Lookup Stage. . . . . . . . . . . . . Sequential File Stage . . . . . . . . . . Teradata Connector Stage. . . . . . . . . Sort Stage . . . . . . . . . . . . . . Transformer Stage . . . . . . . . . . . DataStage Parallel framework changes that require DataStage job modifications . . . . . Troubleshooting for specific operating systems . . Troubleshooting slow jobs that use data sets in cluster environments . . . . . . . . . . Heap allocation errors with DataStage Parallel Jobs on the AIX platform . . . . . . . . . Tuning engine parameters . . . . . . . . . Using tunable parameters in the UVCONFIG file Using tunable parameters in the UVCONFIG file Enabling tracing for DataStage parallel jobs. . . . 26 33 33 42 45 46 52 54 54 57 58 61 62 63 63 68 71 71 72 74 74 78 82
iii
iv
Troubleshooting Guide
Symptoms
When you attempt to start one of the InfoSphere DataStage and QualityStage clients, the following message is displayed:
Failed to authenticate the current user against the selected Domain: Server [servername] not found.
Causes
You might be specifying an incorrect name for the computer that is hosting the services tier.
If the application server has started, the login screen is displayed; otherwise an error message is displayed. You can test whether you specified the correct name for the isserver by attempting to ping the computer that is hosting the services tier.
Symptoms
When you attempt to start one of the InfoSphere DataStage and QualityStage clients, the following message is displayed:
Failed to authenticate the current user against the selected Domain: server [servername] on port [portnumber]. Could not connect to
Causes
The port number is incorrect or is unavailable.
Troubleshooting Guide
Test whether the port is accessible from the client computer by typing at the command line:
telnet hostname port
If you get an error message, then the port is inactive. If you get no response, then the port is active. You can also test which ports are listening on the server computer by typing the following command:
netstat -a
Look for an entry in the form: isserver:port_number You can check whether you are specifying the correct port number in the WebSphere Administrative Console. To look up the port number: 1. From the start menu, select IBM WebSphere > Application Server v6 > Profiles > default > Administrative console to start the WebSphere Administrative Console. 2. Log in using the websphere user name and password that was specified when IBM InfoSphere Information Server was installed. 3. In the left pane, select Servers > Application servers 4. Click the server1 link. 5. Select Communications > Ports. 6. Look for the port number for WC_defaulthost. This is the port number you should use when connecting to the application server.
You can also check whether there is a firewall between the client and the server. If there is a firewall, temporarily disable it to verify that all inbound and outbound ports are open.
Symptoms
Troubleshooting InfoSphere DataStage
The application server fails to start after system is restarted. No messages are generated in the application server logs.
Causes
The Metadata server startup script fails to finish. You must issue the nohup command for the Metadata server startup script.
Environment
IBM AIX or Linux systems.
This command might return multiple files with various prefixes in the name. Some files might be links to other files and could reflect the change you made in the original file without needing to edit each file that was found. If you have multiple instances of WebSphere Application Server installed, unique files might exist for each WebSphere Application Server instance. You only have to modify the files that reference the instances of WebSphere Application Server that you have configured to start as non-root. 2. Identify the files that you need to modify. Typically, you must modify the following files:
Operating system AIX Files /etc/rc#.d/S99ISFServer The number symbol (#) can have the value of 0 through 6. For example: /etc/rc0.d/S99ISFServer /etc/rc2.d/S99ISFServer /etc/rc5.d/S99ISFServer Linux /etc/init.d/ISFServer
3. In each file, change the following content. Locate the following text, where IS_install_path is the directory where you installed InfoSphere Information Server. The default installation path is /opt/IBM/InformationServer:
"IS_install_path/ASBServer/bin/MetadataServer.sh" "$@"
Troubleshooting Guide
Symptoms
When you attempt to start one of the InfoSphere DataStage and QualityStage clients, the following message is displayed:
Failed to authenticate the current user against the selected Domain: or password. Invalid user name (username)
Causes
There are several possible causes of this problem. v The user name is invalid. v v v v The password is invalid or has expired. The user has no suite user role. Credential mapping is required, but has not been defined for this user. The user has no DataStage role or has the incorrect DataStage role.
InfoSphere DataStage does not have its own separate scheduling program. Instead, whenever an InfoSphere DataStage user schedules a job, the underlying operating system controls the job. If scheduled jobs do not run correctly, the problem is usually with the operating system configuration on the engine.
Symptoms
Scheduled jobs do not run when expected.
Environment
This advice applies to the Windows environment.
Symptoms
Scheduled jobs do not run when expected.
Causes
The user ID used to run the schedule service has invalid user name or password details.
Environment
This advice applies to the Windows environment.
Troubleshooting Guide
4. 5. 6. 7.
Click the Schedule tab. Enter the user name and password to test. Click Test. Wait for the user name and password to be verified (this might take some time).
Symptoms
Scheduled jobs do not run when expected.
Causes
The user running the schedule service does not have sufficient user rights.
Environment
This advice applies to the Windows environment.
Symptoms
Scheduled jobs do not run when expected.
Causes
Troubleshooting InfoSphere DataStage
The AT command, which performs the Windows scheduling, only accepts day names in the local language.
Environment
This advice applies to the Windows environment.
You might have to experiment with which day names the local AT command will accept. If in doubt, enter the full name (for example, LUNDI, MARDI, and so on). 4. Repeat these steps for each of your projects. You might find that you get an error message equivalent to There are no entries in the list' when you use the scheduler on a non-English language system. This message is output by the AT command and passed on by the Director client. To prevent the Director client from passing on the message: 1. Identify a unique part of the message that the AT command is outputting (for example, est vide' in French). 2. For each project, add the following line to its DSParams file:
NO ENTRIES=est vide
The AT command usually accepts other keywords besides days of the week in English. If your system does not accept other keywords, you can add localized versions of the additional keywords NEXT, EVERY, and DELETE to your projects by doing the following tasks: 1. Edit the DSParams file for each project. 2. For each keyword, add a line of the form:
KEYWORD=localized_keyword
For example:
NEXT=Proxima
Troubleshooting Guide
If your scheduled job did not run, there are a number of steps that you can take to identify the cause.
Symptoms
Administrator cannot see all the jobs that the users have scheduled.
Environment
This advice applies to the UNIX environment.
Symptoms
Scheduled job does not run when expected.
Environment
This advice applies to the UNIX environment.
Symptoms
Scheduled jobs do not run.
Environment
This advice applies to AIX servers.
Symptoms
Jobs take too long to terminate.
Causes
Each InfoSphere DataStage project directory contains a &PH& directory. The &PH& directory contains information about active stages that is used for diagnostic purposes. The &PH& directory is added to every time a job is run, and needs to be cleared periodically.
Symptoms
Failure of the stage with symptoms such as a memory fault and corresponding core dump.
Causes
If you are running a 64bit version of InfoSphere DataStage, you must ensure any database clients you use are also 64bit. If you are running a 32bit version of InfoSphere DataStage, you must ensure any database clients you use are also 32bit. For example, Oracle database is available with both 32- and 64- bit clients. You must use the 32-bit client with 32-bit InfoSphere DataStage, and the 64-bit client with 64-bit InfoSphere DataStage.
Environment
Applies to 64bit UNIX, Linux, or Windows environments.
Symptoms
10
Troubleshooting Guide
If a job fails to connect to a data source using an ODBC connection, test the connection outside the job to see if the ODBC connection is the source of the problem.
Environment
The procedure applies to ODBC connections in a UNIX environment.
Where DSN specifies the connection that you want to test. 6. Enter the user name and password to connect to the required data source. 7. After you have connected to the data source, enter .Q to close the connection.
Symptoms
Cannot connect to database using ODBC connection.
Environment
This problem occurs when using ODBC connections in a UNIX environment.
check that the ODBC driver shared library has been added to the environment variable used to locate shared libraries
11
Table 2. Library path environment variables Platform Solaris HP-UX HP-UX Itanium AIX Linux Environment variable LD_LIBRARY_PATH SHLIB_PATH LD_LIBRARY_PATH LIBPATH LD_LIBRARY_PATH
Symptoms
Cannot connect to database using ODBC connection.
Causes
If you have moved shared libraries to a new directory or have installed a new ODBC driver manager, you might have broken symbolic links that the engine uses to access the shared libraries for the source database.
Environment
This problem occurs when using ODBC connections in a UNIX environment.
$DSHOME is the home directory of the server engine. pathname is the full path name of the directory that contains the shared libraries. To reset links for a new ODBC driver manager: 1. Install the ODBC driver manager according to the vendor's instructions. 2. Determine where the ODBC shared library libodbc.xx resides. For example, the library for the Intersolv driver is in $ODBCHOME/ dlls, and the library for the Visigenics driver is in $ODBCHOME/ libs. 3. Close all InfoSphere DataStage clients. 4. Run the relink.uvlibs command as described above. 5. Restart the InfoSphere DataStage clients.
Symptoms
Jobs fail because they run out of file units.
12
Troubleshooting Guide
Environment
This advice applies to UNIX systems.
Ensure that you allow at least thirty seconds between executing stop and start commands.
Symptoms
Jobs with large memory requirements cause unable to locate memory errors.
Environment
This advice applies to AIX systems.
4. Add the following line to the dsenv file (in the $DSHOME directory):
LDR_CNTRL=MAXDATA=0x30000000;export LDR_CNTRL
5. Run the dsenv command to apply the new environment settings. 6. Restart the engine:
$DSHOME/bin/uv -admin -start
13
You can do the following actions on the Automatic error report message: v Click ds_errorreport_YYMMDDHHmm.zip to view the directory containing the error reports using the Windows File Explorer. v Click customized to open the Customize Report window where you can add a description of the scenario that caused the problem. v Click More to display details of the exception and the client machine. The ds_errorreport_YYMMDDHHmm.zip file contains the following information: v the original error message v the stack trace and exception details v the client machine details
14
Troubleshooting Guide
v the Client Version.xml file v the associated dstage_wrapper_trace_NN.log file v an optional user-defined description, entered on the Customize Report window
You can do the following actions on the Optional error report message: v Click here to create an error report for the exception. The Customize Report window opens, where you can add a description of the scenario that caused the problem. v Click More to display details of the exception and the client machine.
15
Stack Trace
The error indicates that your client cannot connect to the Information Server Services Tier (domain) server. There are many reasons that can cause this problem to occur. It can be as simple as an invalid server name or port number. Click the More button to get a stack trace for the error.
javax.security.auth.login.LoginException: Could not connect to server [RMANIKON-2] on port [9081]. at com.ascential.acs.security.auth.client.AuthenticationService.getLoginException (AuthenticationService.java:991) at com.ascential.acs.security.auth.client.AuthenticationService.doLogin (AuthenticationService.java:370) Caused by: com.ascential.acs.registration.client.RegistrationContextManagerException: Caught an unexpected exception. at com.ascential.acs.registration.client.RegistrationContextManager.setContext (RegistrationContextManager.java:76) at com.ascential.acs.security.auth.client.AuthenticationService.doLogin (AuthenticationService.java:364) Caused by: com.ascential.acs.registration.client.RegistrationHelperException:
16
Troubleshooting Guide
Caught an unexpected exception. at com.ascential.acs.registration.client.RegistrationHelper.getBindingProperties (RegistrationHelper.java:672) at com.ascential.acs.registration.client.RegistrationHelper.getBindingConfigProperties (RegistrationHelper.java:566) at com.ascential.acs.registration.client.RegistrationContextManager.setContext (RegistrationContextManager.java:173) at com.ascential.acs.registration.client.RegistrationContextManager.setContext (RegistrationContextManager.java:73) ... 1 more Caused by: java.net.ConnectException: Connection refused: connect at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:391) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:252) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:239)
There are four important things to note in the stack trace. There is no text that states Trace from Server, so this means that it is a client side issue. Look at the first highlighted message in the stack trace example. It is giving the host name and port number. The second highlighted message indicates that the error happens during the RegistrationHelper call. The last thing to note is the last highlighted message indicates the root cause is a socket connection error.
17
Windows: Check IBM WebSphere Application Server status => Started UNIX or Linux: ps -ef | grep java root 25468 1 0 May 02 ? 33:33 /u1/IBM/WebSphere/AppServer/java/bin/java ... Another issue might be that the port is blocked by a firewall. You can do a quick test by trying to telnet to the host and port number. Use the command telnet <DataStage host> <port number>. If the telnet fails, then the port is most likely blocked. If you are on Linux, you might also use the nc command to see if the port is open. If the port is blocked, your administrator must open the port. The last issue might be that WebSphere Application Server is not running. For Windows, go to Services in the control panel and see if the service IBM WebSphere Application Server is started. For UNIX and Linux use the command ps ef | grep javaand check to be sure that the WebSphere process is running.
This is a unique symptom and the root cause for this is that the WebSphere is not at a right Java level. To resolve this issue, WebSphere Java must be upgraded to Java SDK 1.4.2 SR10
18
Troubleshooting Guide
This is caused by a known ORB defect that can ve resolved by upgrading WebSphere Application Server with iFix: PK76826
In the last scenario, there is an example stack trace from the client with an error that is caused by a bad hosts file on the server. The first message in the red indicates that the error is caused at the login time by the authentication service and the last red message indicates that this happened during the server lookup.
Troubleshooting InfoSphere DataStage
19
Your /etc/hosts entry might use a single-line format, which can cause problems. The single-line format is similar to the following entry:
127.0.0.1 localhost.localdomain localhost machine_long_hostname machine_short_hostname
To resolve the problem, separate the /etc/hosts entry to use the following double-line format:
127.0.0.1 localhost.localdomain localhost <real ip address> machine_long_hostname machine_short_hostname
Causes
v Client has invalid entry in host file v Server listening port might be blocked by a firewall v Server is down
20
Troubleshooting Guide
v Update the host file on client system so that the server host name can be resolved from client. v Make sure the WebSphere TCP/IP ports are opened by the firewall. v Make sure that the WebSphere application server is running.
Failed to authenticate the current user against the selected Domain: CORBA MARSHAL 0x4942f89a No; nested exception is: org.omg.CORBA.MARSHAL: Trace from server: 1198777258 at host PURPLE1 >> org.omg.CORBA.MARSHAL: Unable to read value from underlying bridge : initial and forwarded IOR inaccessible:Forwarded IOR failed with: java.net.SocketException: Operation timed out: connect:could be due to invalid address:host=10.38.86.83, port=3953Initial IOR failed with: java.net.SocketException: Operation timed out: connect:could be due to invalid address:host=10.38.86.83,port=3953 vmcid: IBM minor code: 89A completed: No at com.ibm.rmi.iiop.CDRInputStream.read_value(CDRInputStream.java:1993) at com.ascential.acs.security.auth.server. _EJSRemoteStatelessAuthenticationService_e0d03809_Tie. login(_EJSRemoteStatelessAuthenticationService_e0d03809_Tie.java:146) at com.ascential.acs.security.auth.server. _EJSRemoteStatelessAuthenticationService_e0d03809_Tie. _invoke(_EJSRemoteStatelessAuthenticationService_e0d03809_Tie.java:92) at com.ibm.CORBA.iiop.ServerDelegate.dispatchInvokeHandler(ServerDelegate.java:614) at com.ibm.CORBA.iiop.ServerDelegate.dispatch(ServerDelegate.java:467) at com.ibm.rmi.iiop.ORB.process(ORB.java:439) at com.ibm.CORBA.iiop.ORB.process(ORB.java:1761) at com.ibm.rmi.iiop.Connection.respondTo(Connection.java:2376) at com.ibm.rmi.iiop.Connection.doWork(Connection.java:2221) at com.ibm.rmi.iiop.WorkUnitImpl.doWork(WorkUnitImpl.java:65) at com.ibm.ejs.oa.pool.PooledThread.run(ThreadPool.java:118) at com.ibm.ws.util.ThreadPool$Worker.run(ThreadPool.java:1475) << END server: 1198777258 at host PURPLE1 vmcid: IBM minor code: 89A completed: No javax.security.auth.login.LoginException: CORBA MARSHAL 0x4942f89a No; nested exception is: org.omg.CORBA.MARSHAL:
Causes
v The client IP address is listed in the stack trace, and is not reachable from the server v The client port is blocked
21
IBM minor code: 89A completed: No at com.ibm.rmi.iiop.CDRInputStream.read_value(CDRInputStream.java:1993) at com.ascential.xmeta.shared.repository.core. _EJSRemoteStatefulSandboxRemoteStatefulService_ 4baa4bb1_Tie.executeQuery__CORBA_WStringValue __CORBA_WStringValue__com_ascential_ xmeta_crud_InternalQueryOptions__com_ascential _xmeta_crud_InternalQueryCompileOptions__ java_util_Map(Unknown Source) at com.ascential.xmeta.shared.repository.core. _EJSRemoteStatefulSandboxRemoteStatefulService_ 4baa4bb1_Tie._invoke(Unknown Source) at com.ibm.CORBA.iiop.ServerDelegate.dispatchInvokeHandler (ServerDelegate.java:614) at com.ibm.CORBA.iiop.ServerDelegate.dispatch (ServerDelegate.java:467) at com.ibm.rmi.iiop.ORB.process(ORB.java:439) at com.ibm.CORBA.iiop.ORB.process(ORB.java:1761) at com.ibm.rmi.iiop.Connection.respondTo(Connection.java:2376) at com.ibm.rmi.iiop.Connection.doWork(Connection.java:2221) at com.ibm.rmi.iiop.WorkUnitImpl.doWork(WorkUnitImpl.java:65) at com.ibm.ejs.oa.pool.PooledThread.run(ThreadPool.java:118) at com.ibm.ws.util.ThreadPool$Worker.run(ThreadPool.java:1475) <<END>>
Causes
The WebSphere Application Server SDK is outdated
Server callback failure with a WebSphere Application Server SDK that is outdated
Symptoms
The stack trace includes the following message. Read beyond end of data. No fragments available
Trace from server: 1198777258 at host green.bocaraton.ibm.com >> org.omg.CORBA.MARSHAL: Unable to read value from underlying bridge : No available data: Request 18:read beyond end of data. No fragments available. vmcid: IBM minor code: 89A completed: No at com.ibm.rmi.iiop.CDRInputStream.read_value(CDRInputStream.java:1993) at com.ascential.acs.security.auth.server. _EJSRemoteStatelessAuthenticationService_e0d03809_ Tie.login(_EJSRemoteStatelessAuthenticationService_e0d03809_Tie.java:146) at com.ascential.acs.security.auth.server. _EJSRemoteStatelessAuthenticationService_e0d03809_ Tie._invoke(_EJSRemoteStatelessAuthenticationService_e0d03809_Tie.java:92) at com.ibm.CORBA.iiop.ServerDelegate.dispatchInvokeHandler(ServerDelegate.java:614) at com.ibm.CORBA.iiop.ServerDelegate.dispatch(ServerDelegate.java:467) at com.ibm.rmi.iiop.ORB.process(ORB.java:439) at com.ibm.CORBA.iiop.ORB.process(ORB.java:1761) at com.ibm.rmi.iiop.Connection.respondTo(Connection.java:2376) at com.ibm.rmi.iiop.Connection.doWork(Connection.java:2221) at com.ibm.rmi.iiop.WorkUnitImpl.doWork(WorkUnitImpl.java:65) at com.ibm.ejs.oa.pool.PooledThread.run(ThreadPool.java:118) at com.ibm.ws.util.ThreadPool$Worker.run(ThreadPool.java:1475) << END server: 1198777258 at host green.bocaraton.ibm.com
Causes
The ORB data is fragmented, which is a known issue.
22
Troubleshooting Guide
Causes
v Some Linux computers automatically configure the host file with the following entry: 127.0.0.1 localhost.localdomain local host machinelonghostname machineshorthostname v The server has more than one IP address, and one IP address is not accessible from the client
23
Ensure that the host name of each endpoint in WebSphere Application Server is resolved as a client-accessible IP address. The WebSphere Application Server endpoint configuration can be found from WebSphere administrator console: Servers -> Application servers -> server1 -> Ports. The server name specified on the endpoint must be resolved as a client-accessible IP address. IP address 127.0.0.1 or IP address 192.168.x.x are normally not accessible.
Troubleshooting job design issues IBM InfoSphere DataStage Error: Job xxx is being accessed by another user
A job can be accessed only by one user at a time.
Symptoms
You are unable to view a job, and receive the following error message.
Error: Job xxx is being accessed by another user
Causes
The job that you are trying to view is currently being accessed by another user.
l. Check Enable job administration in Director m. Click OK n. o. p. q. r. Click Close Exit DataStage Director and relaunch Repeat steps C through I. Log in to the server as the dsadm user cd to the DSEngine directory
24
Troubleshooting Guide
Enter ./dsenv to source the dsenv file Enter ./bin/uvsh to get into DataStage prompt At ">" DataStage engine prompt, enter LOGTO project name Run LIST.READU EVERY to list all the locks Check active record locks under "Item Id" column for job name or RT_CONFIG# or RT_LOG# (# matches the job description number x. Write down the Inode numbers and user numbers associated with these locks y. Enter LOGTO UV. If the LOGTO command is disabled, enter the following command: s. t. u. v. w.
CHDIR path_to_the_DSEngine_folder
The UNLOCK command lives in the UV account. z. Enter UNLOCK INODE inode# USER user# ALL aa. You can use Q to get out of DataStage engine 3. Use cleanup_abandoned_locks utility to clear any abandoned locks. The cleanup_abandoned_locks utility deletes session locks from the Information Server repository that were left over from some usage of an Information Server suite application such as DataStage. Log in to the domain layer as either the root or Administrator user.
cd /opt/IBM/InformationServer/ASBServer/bin ./cleanup_abandoned_locks.sh (on Unix/Linux) ./cleanup_abandoned_locks.bat (on Windows) usage: cleanup_abandoned_locks -P,--password Password -U,--user User name -h,--help Print this message.
Causes
The user was no longer connected to DataStage. No session found in the Web Console for the user.
Symptoms
The user deletes a job from the DataStage Designer, and receives the following error:
Troubleshooting InfoSphere DataStage
25
Unable to delete the item(s). Delete object for \<path>\<jobname> failed. Cannot get exclusive access to log for job <jobname>
Causes
A lock remains on the RT_LOG file for the job.
Causes
v Incorrectly configured repository database v Leftover metadata in repository database from a previously failed project creation v Unable to create log file on the DataStage Server v Incorrectly configured locale on the DataStage Server v Incorrectly configured locale on the DataStage Server v Failed to load JVM into the DataStage Server process (dsapi_slave) v Firewall configuration v Trusted authentication between DataStage Server system and the Domain system failed
26
Troubleshooting Guide
v v v v v
DataStage was not installed on the Domain system Locale regional settings customized on the Client system Disk / partition full or user quota reached on DataStage Server system Project creation fails at "Initializing demo files..." within the Administrator client Stack Execution Disable (SED) is enabled (AIX only)
v Unable to increase the table space for the metadata repository (XMETA) v Error updating secondary indexes
These commands attach to the running dsrpcd process and record all of the system calls that are made by that process and its children during subsequent client-server sessions. For example, a call to create a project from the Administrator client or the dsadmin command line is recorded. To produce extra diagnostic information for the JVM initialization after all of its libraries are successfully loaded you can enable JVM startup tracing. Add the following lines to /opt/IBM/InformationServer/Server/DSEngine/dsenv:
27
On Windows these tracing options can be set as System Environment Variables by using the System Control panel. Remember to restart the DataStage Server engine processes after adding these variables, and to remove these environment variables after they are no longer needed. Enable repository database tracing. To enable tracing of the code that populates the repository database follow these steps: 1. Create a file on the DataStage Server system in /opt/IBM/InformationServer/ ASBNode/conf/ called NewRepos.debug.properties. The file name is case sensitive. 2. In the file, add the following three lines:
log4j.logger.com.ascential.dstage=DEBUG log4j.logger.com.ibm.datastage=DEBUG NewRepos.spy.trace=true
The dstage_wrapper_trace_N.log will then contain extra tracing information the next time a project creation is attempted. Ensure that you delete the NewRepos.debug.properties file when finished. In addition, spy trace files, such as dstage_wrapper_spy_N.log, are produced in the same directory as the log files. These files contain a detailed record of low-level method calls and can grow large. Running project creation manually. The project creation code runs in the context of a dsapi_slave process which does not have any console output. Locate the full "RUN BP DSR_QUICKADD.B" command line from the domain installation log files on /opt/IBM/InformationServer/logs/. Use the following commands to run the project creation code so that you can view the console output: Linux and UNIX 1. cd /opt/IBM/InformationServer/Server/DSEngine 2. 3. 4. 5. ./dsenv bin/uvsh RUN BP DSR_QUICKADD.B <<i>arguments from log file</i>> QUIT
Windows 1. cd /opt/IBM/InformationServer/Server/DSEngine 2. bin/uvsh 3. RUN BP DSR_QUICKADD.B <arguments from log file> <newProjectName> C:\IBM\InformationServer\Server\Projects\<newProjectName> CREATE 4. QUIT
28
Troubleshooting Guide
8.1 and later message: "DSR.ADMIN: Error creating DR elements, Error was Unique constraint violation." These types of errors usually occur because the repository database returns an error when attempting to make an update. The dstage_wrapper_trace_N.log file may contain more specific details about the exact database error. There might also be a database log, depending on what type of database the repository is running in, which contains more information. For example, DB2 has the db2diag tool which can be run to find out the exact reason why an update failed. Typical failures are: out of disk space, memory configuration problems and so on For repository database errors it is important to confirm that the database was created with the scripts supplied on the installation media. These scripts configure important database parameters which if missed might cause project creation problems. It is also important that the database was created with the correct character set, as per the database creation script documentation on the installation media (typically UTF16/32). If a different character set was used some of the metadata stored can become corrupted or might cause unexpected primary key violations. If the wrong character set was used the product needs to be reinstalled. For errors at this level the WebSphere Application Server logs might contain additional information. The files SystemOut.log and SystemErr.log can be found in the following directory: ...WebSphere/AppServer/ profiles/_profile_name_/logs/server1/ Leftover metadata in repository database from a previously failed project creation 8.0.x message: "Error creating DR elements, Error was -1" This problem occurs only on 8.0.x systems and can be identified by looking in the dstage_wrapper_trace_N.log file for a "unique constraint violation" error. This can occur when a project creation failed and did not remove all of its metadata from the repository. Even though the project cannot be seen in DataStage, attempting to create a project of the same name results in this error. To work around this problem, you can simply create a project with a different name. Alternatively, IBM support can provide a tool and instructions for how to remove the leftover data from the repository. Unable to create log file on the DataStage Server 8.0.x message: "Error creating DR elements, Error was -1" 8.1 and later message: DSR.ADMIN: Error creating DR elements, Error was log4j:ERROR setFile(null,true) call failed. Just before the metadata repository is populated with the default project contents, a log file is created on the DataStage Server system in /home/_Credential_Mapped_Username_/ds_logs/. If this log file cannot be created, the project creation will fail. On Windows computers, the user home directory is usually C:\Documents and Settings\ _Credential_Mapped_UserName_
29
The usual reasons why this log was not created are either because the user has no home directory at all or they do not have appropriate permissions on it. Incorrectly configured locale on the DataStage Server 8.0.x message: "Error creating DR elements, Error was -1" 8.1 and later message: "DSR.ADMIN: Error creating DR elements, Error was Unmatched quotation marks" This problem is ultimately caused by bad locale configuration on the DataStage Server system. This manifests itself because the "hostname" command is run during project creation, and instead of returning the correct host name it returns a string such as "couldn't set locale correctly". Failed to load JVM into the DataStage Server process (dsapi_slave) 8.0.x message: "(The connection is broken (81002))" The JVM (Java Virtual Machine) can fail to load for several reasons. If it does fail to load, the dsapi_slave process is terminated, which results in broken connection errors on the client such as error 81002. A core file might be produced which can be used to determine what caused the process to be terminated. Possible causes of this problem are: v The LIBPATH (or equivalent) is too long and caused a buffer overflow. This can be confirmed by using the Administrator client to run the "env" command with the Command button. If the contents of LIBPATH are duplicated then it is probable that dsenv are sourced twice. The dsenv does not need to be sourced when starting the DataStage Server engine processes with the uv -admin -start command. v Incompatible or missed patches on the Client, Server, and Domain systems. By looking in the version.xml file of each system you can confirm what patches are installed. Ensure that patches are installed on all appropriate systems. v Environment variables such as LDR_CNTRL were added or modified in the IBM/InformationServer/Server/DSEngine/dsenv file. Generally speaking LDR_CNTRL settings in dsenv must not be modified unless otherwise directed by IBM. v Incompatible operating system kernel parameters. Firewall configuration 8.0.x message: "Error creating DR elements, Error was -1" 8.1 and later message: DSR.ADMIN: Error creating DR elements, Error was com.ascential.xmeta.exception.ServiceException The DataStage Server system needs to communicate with the domain system, which means that certain ports need to be open between these systems if they are on separate machines. This problem can be confirmed by looking in the dstage_wrapper_N.log file for the following error: Connection refused:host=<hostname>,port=2809. Ensure that the firewall is correctly configured and use telnet <hostname> <port> from the DataStage Server machine to confirm the port is accessible. The necessary firewall configuration can be found in the installation guide.
30
Troubleshooting Guide
Trusted authentication between DataStage Server system and the Domain system failed 8.0.x message: "Error creating DR elements, Error was -1" 8.1 and later message: "DSR.ADMIN: Error creating DR elements, Error was Mapping failed to copy attributes: MetaTable -> DSTableDefinition (EObject: null, MetaTable)" The DataStage Server system authenticates with the Domain system by a process called trusted authentication. This process uses a secure certificate exchange rather than explicit user name and password authentication. If the process fails, the project creation fails. Trusted authentication failure is identified by multiple exceptions in the DataStage Server ds_logs that says "Null session". This can fail for a number of reasons: v If the DataStage Server is installed onto a Windows system (say C:\IBM\InformationServer), installing the clients into a different directory (say C:\IBM\InformationServer2) causes the certificate exchange to fail, ultimately causing the project creation to fail. See technote #1409412 and APAR JR34441 for more information. v The number of trusted sessions reaches a maximum limit, so a new session cannot be started. This is identified by an entry in the WebSphere logs that says the limit is exceeded. If so, restarting WepSphere Application Server clears everything so that new sessions can be created and project creation can succeed. DataStage was not installed on the Domain system 8.0.x message: "Error creating DR elements, Error was -1" 8.1 and later message: "DSR.ADMIN: Error creating DR elements, Error was Mapping failed to copy attributes: MetaTable -> DSTableDefinition (EObject: null, MetaTable)" When installing the Domain and DataStage Server onto different physical systems, the installation of DataStage Server fails to create projects specified in the installer if DataStage is not installed onto the Domain. These errors can be found in the installation logs. Furthermore, attempting to create projects that use the Administrator client or command line fails. In both these cases, the exceptions state that The package with URI "http:///1.1/DataStageX.ecore" is not registered. DataStage can be added to the Domain system by rerunning the installer, selecting DataStage and clearing the other components. Locale or regional settings customized on the Client system 8.0.x message: "Error creating DR elements, Error was -1" 8.1 and later message: "Invalid Node Name %1" If the regional language settings are modified to use a customized short date format (for example "ddd dd/MM/yyyy") it can cause the DataStage Administrator client to send the wrong date information to the DataStage Server, causing project creation to fail. A patch for this issue is available under APAR JR34770. Disk / partition full or user quota reached on DataStage Server system 8.0.x message: "Error creating DR elements, Error was -1"
Troubleshooting InfoSphere DataStage
31
8.1 and later message: "DSR.ADMIN: Error creating DR elements, error was log4j: ERROR failed to flush writer." The project creation operation creates a log file on the DataStage Server system, called dstage_wrapper_trace_N.log, in the path indicated at the beginning of this document. The log creation fails when the disk /partition is full or the user to which credentials are mapped to reaches their disk quota. Free up space as necessary and try the operation again. Project creation fails at "Initializing demo files..." within the Administrator client 8.5 message: "Errors were detected during project creation that might render project <name> unstable. Caused by: DSR.ADMIN: Error creating DR elements, Error was <date timestamp> java.utils.prefs.FileSystemPreferences$2 run. This error states that there was a problem with being able to write Java preference data. One of the following items causes these problems: v SE (Security Enhanced) Linux is enabled. v The user ID that is trying to create the project does not have a local home directory to write to If SELinux is enabled, disable it. To determine if SELinux is installed and in enforcing mode, you can do one of the following actions: v Check the /etc/sysconfig/selinux file v Run the sestatus command v Check the /var/log/messages file for SELinux notices (Notice format might differ between RHEL 4 and RHEL 5.) To disable SELinux, you can do one of the following actions: v Set it in permissive mode and run the setenforce 0 command as a superuser v Modify /etc/sysconfig/selinux and reboot the machine If there is no home directory for the user ID, create a local home directory with write permissions. (766) and have the group as part of the local dstage group. Stack Execution Disable (SED) is enabled (AIX only) 8.0.x message: "Error creating DR elements, Error was -1" 8.1 and later message: "Unable to confirm the JVM can be loaded into the DataStage server process 'DSR_CREATE.PROJECT.B TestJVM' failed" If Stack Execution Disable (SED) is enabled in AIX, the JIT compiler fails when trying to run code it generated in the process data area. This occurs with all of the DataStage executable items that have embedded JVMs. The solution to this problem is to turn off the SED at the system level and reboot the machine. To turn off the SED use the command: sedmgr -m off Unable to increase the table space for the metadata repository (XMETA) 8.0.x message: "Error creating DR elements, Error was -1" 8.1 and later message: "DSR.ADMIN: Error creating DR elements, Error was unable to save"
32
Troubleshooting Guide
If DB2 is used for the metadata repository (XMETA) look in the <db2instance_home>/sqllib/db2dump/db2diag.log for errors. To resolve this problem, increase the table space, and try the operation again. It might be necessary to manually delete any partially created project, which can be done by following the material here: http://www-01.ibm.com/support/docview.wss?uid=swg27021312 Error updating secondary indexes Error message: "DSR.ADMIN: Error updating secondary indexes. Status code = -135 DSJE_ADDPROJECTFAILED" A known cause for the error error updating secondary indices is one or more missing I_* directories in the /opt/IBM/InformationServer/Server/ Template directory. If there is another DataStage engine installation (of the same version and patch level) available it is possible copy the Template directory from the working engine and use it to replace the Template directory on the broken engine. However be careful the backup the existing Template directory first. If a template directory is taken from a working engine of a different patch level, some of the patches on the broken engine might be rendered ineffective.
Symptoms
You receive the following message
DataStage parallel job fails with fork() failed, Resource temporarily unavailable
Causes
This error occurs when the operating system is unable to create all of the processes that are needed for the job at run time. Unfortunately, the exact reason for the failure is not available. This problem occurs on UNIX and Linux platforms for the following reasons: v The maximum process limit is reached v The kernel or the maximum open file limit is reached v The swap space allocation or pre-allocation is exceeded
33
environment and you need to scale back the job run time. This might reduce performance, but allow the job to complete. To reduce the number of processes, use the following methods: v Reduce the number of logical nodes in the APT_CONFIG_FILE v Ensure that APT_DISABLE_COMBINATION is not set The following command can be used on the AIX platform to see the current value of maxuproc: lsattr -E -l sys0 | grep maxuproc A reasonable setting for environments that are running large jobs would be MAXUPROC = 1000. To optimize this value, you can monitor the number of processes over time that are running daily, and then set the value appropriately. Here is some sample shell script code that you can use to monitor the number of processes that belongs to the 'dsadm' user. The script loops 365 times and take a measurement every 5 seconds.
#!/bin/sh COUNTER=360 rm dsadm_count.txt until [ $COUNTER -lt 0 ]; do let COUNTER-=1 sleep 5 date >>dsadm_count.txt ps -ef | grep dsadm |wc -l >> DSADM_uproc_values.txt done
There are special considerations for the Windows platform. Read the related technotes for tuning Windows environments for large jobs. Tuning Windows Environments
Symptoms
The DataStage job aborts with the following message:
main_program: The section leader on xxx died.
Causes
This section leader xxx died error is related to temporary resource non-availability. The conductor process is timing out because it did not receive an acknowledgment from the player process that it started successfully.
34
Troubleshooting Guide
Set the APT_PM_NODE_TIMEOUT to 300. This resolution might resolve the error message but might not resolve availability of resources on the system. See this note from the 'Parallel Job Advanced Developers Guide' on this environment variable: APT_PM_NODE_TIMEOUT The need for long timeouts in the job startup process shows that the engine tier hardware is approaching overload. It is better to run fewer concurrent jobs order to keep startup times low.
DataStage jobs fail with error message: Message: Error setting up internal communications (fifo RT_SCTEMP/jobName.fifo)
Symptoms
DataStage jobs fail with the following error message:
Message: Error setting up internal communications (fifo RT_SCTEMP/jobName.fifo)
DataStage is unable to create, delete, read, or write a temporary fifo file for a job to the RT_SCTEMP directory within the project that owns the job.
Causes
The error occurs for the following reasons: v DataStage cannot process the file because they are locked. v DataStage cannot process the file because of inadequate file permissions or other file system problems. v Virus Scan or backup program interferes with writing fifo files to temp or scratch directories, which is described in the following technote: https://www-304.ibm.com/support/docview.wss?uid=swg21445893&wv=1
In that situation, read the following technote for instructions on how to clear locks for a job: https://www-304.ibm.com/support/docview.wss?uid=swg21438482 If the issue is not caused by locks, then review the following checklist to resolve other common causes for this error: v Check the file limits at job run time, especially if all jobs run under a common user ID such as DSADM. You can check the limits used at DataStage job runtime even if you cannot run jobs, by running the command through the DataStage Administrator client. Log in to DataStage Administrator Client, select the failing project, click the COMMAND button, and then enter command: sh ulimit -a If the number is under 2048, consider increasing it. On busy systems it may need to be higher. In this situation, you can add command to set the limit to $DSHOME/dsenv script such as: ulimit -n 10240 After you make the change you must stop and restart DataStage and then perform the test again to ensure that new limit is in effect. v Check available space on the volume that contains the RT_SCTEMP directory. If the project that contains failing jobs is named "MyProject", then the path to RT_SCTEMP is similar to the following path: /opt/IBM/InformationServer/ Server/Projects/MyProject/RT_SCTEMP v Check permissions for the RT_SCTEMP directory and the files inside it. Ensure that the user ID running jobs, which is listed on event messages in job log, has
Troubleshooting InfoSphere DataStage
35
read and write permission to the directory and the files within it either via ownerid, group membership, or public permissions. A quick test to confirm if permissions are the problem is to set directory permissions temporarily to 777 so that all users can write to it. v Confirm that DSADM or the user ID that runs the failing jobs can create a file in this directory using the following steps: Login to server operating system as the user ID who runs the failing DataStage jobs. Change directory to InformationServer/Server/Projects/projectname/ RT_SCTEMP directory. Enter the following command: touch test.fifo If the above command fails, then the user ID is unable to create a file at that location and that issue must be resolved before DataStage jobs can run correctly. If this issue is not due to locks, then the DataStage error occurs due to an inability to correctly create, read, write, or delete the temporary fifo files. If the above tests to not isolate the cause of file system i/o problem, then it may be necessary to contact Information Server support for assistance in performing a system trace (truss or strace) of the dsapi process launching the failing jobs to track down the actual OS operations which are failing.
Information Server or DataStage job fails with "Could not map table file"
Jobs start to fail when memory is fragmented, or the amount of data that is used in the Lookups exceeds its limit.
Symptoms
Information Server or DataStage job fails with the following message:
Could not map table file
DataStage fails when trying to load lookup data into memory or create lookup file
Causes
There might be other applications that are running concurrently with resources that are no longer available to DataStage. Available memory might be reconfigured from creating or moving lpars. ** LDR_CNTRL environment setting on AIX might limit the ulimit -d (data) setting even if you have hard limit set higher. DataStage is limited to the amount of memory that can be allocated for a Lookup. A single Lookup stage in designer has multiple Lookup inputs. These stages analyze the corresponding number of Lookup operators in the generated osh script. When operator compatibility is optimal, each Lookup operator has one physical process for each partition that is defined by the configuration file. Each physical process can address only up to 2 GB of memory because it is a 32-bit application. The Windows version of the DataStage Parallel Engine is only available with 32-bit pointers. Each lookup requires contiguous memory allocation. Each process is limited to the ulimit setting of the DataStage Environment that can be limited by LDR_CNTRL on AIX. Each lookup data set uses the entire partitioning method by default. With the entire partitioning method, one memory segment is used and shared across all partitions for a physical server. The method is defined by the fastname option in the configuration file.
36
Troubleshooting Guide
For MPP environments, each server has an individual copy of a memory segment. If you use a method other than entire or auto partitioning, each partition uses its own copy of data in memory and only up to 2 GB or ulimit -d (data)**. This method is the most restraining. All lookup data for lookups are processed to a file in scratch and then loaded to a mmap structure in memory. The mmap function is a C++ function. Allocation of this structure requires contiguous memory, and happens before any source data is processed for the lookup.
Parallel startup failed for job runs on multiple nodes across multiple servers
Parallel jobs can fail from configuration errors.
Symptoms
A parallel DataStage job with configuration file that is set up to run multiple nodes on a single server fails with the following error:
Message: main_program: **** Parallel startup failed ****
Causes
The full text for this parallel startup failed error provides some additional information about possible causes of the problem. The problem is often caused by one of the following configuration errors: This problem is caused by configuration errors v The Orchestrate installation directory is not properly mounted on all nodes. v The rsh permissions are not set correctly with /etc/hosts.equiv or .rhosts. v The job runs from a directory that is not mounted on all nodes The messages in the server log that precede the startup failed message contain more information about the cause of the failure. For the situation where a site is attempting to run multiple nodes on multiple server machines, the above statement is correct. More information about setting up ssh/rsh and parallel processing can be found in the following topics: v Configuring remote and secure shells v Configuring a parallel processing environment In the case where all nodes are running on a single server machine, the "Parallel startup failed" message is usually an indication that the fastname defined in the configuration file does not match the name output by the "hostname" command on the server. In a typical node configuration file, the server name where each node runs is indicated by the fastname in the /opt/IBM/InformationServer/Server/ Configurations/default.apt
{ node "node1" {fastname "server1"
Troubleshooting InfoSphere DataStage
37
pools "" resource disk "/opt/resource/node1/Datasets" {pools ""} resource scratchdisk "/opt/resource/node1/Scratch" {pools ""} } node "node2" { fastname "server1" pools "" resource disk "/opt/resource/node2/Datasets" {pools ""} resource scratchdisk "/opt/resource/node2/Scratch" {pools ""} } }
Symptoms
The DataStage job log contains the following unrecoverable error:
Item #: 13 Event ID: 1960 Timestamp: 2011-09-01 06:30:44 Type: Fatal
38
Troubleshooting Guide
User Name: dsadm Message Id: IIS-DSEE-TFPM-00154 Message: main_program: APT_PMConnectionRecord::start: Reading connection message returned 28, expected 40, Error 0 Item #: 14 Event ID: 1961 Timestamp: 2011-09-01 06:30:44 Type: Fatal User Name: dsadm Message Id: IIS-DSEE-TFPM-00356 Message: main_program: **** Parallel startup failed **** This is usually due to a configuration error, such as not having the Orchestrate install directory properly mounted on all nodes, rsh permissions not correctly set (via /etc/hosts.equiv or .rhosts), or running from a directory that is not mounted on all nodes. Look for error messages in the preceding output.
Causes
The framework that is used by DataStage to start all the parallel processes uses TCP/IP connections during the startup phase, even on single-host configurations. The processes are listening for very specific responses on these ports, to coordinate the startup. This error means that one of the processes received an unexpected response and terminated. These ports are used for low-level coordination between the specific DataStage processes, not for user requests, so there is very little error handling capability. When this error occurs it is an indicator that some process other than DataStage has connected to one or more of the ports and put invalid data there. The DataStage process that receive this unauthorized connection has no other alternative except to print the error and exit. The default port range used by DataStage for this communication is 10,000 - 11,000 and 11,000 and up (there is no upper bound, but it reasonably will not be more than a few thousand.) These are not common port ranges for other software applications to use, so when this problem occurs it usually means that port scanning software, network monitoring software, or intrusion detection software might be the cause.
You can set these environment variables at the job or project level for testing. When you find suitable values and the problem does not recur, set and export these variables in the /opt/IBM/InformationServer/dsenv file so that they take effect for all projects.
DataStage Parallel job failed to start because of a process fork failure in Solaris.
Symptoms
A parallel job terminates with the following message:
Fatal Error: Unable to start ORCHESTRATE process on node node1 (sun01): APT_PMPlayer::APT_PMPlayer: fork() failed, Not enough space
Troubleshooting InfoSphere DataStage
39
Causes
This error indicates that the fork() system call failed with an ENOMEM error returned, which means there is not enough swap space to support the virtual memory required by the call. The ENOMEM error for fork() can occur in other Operating Systems like AIX, Linux, or HP-UX, but it is seen more frequently in Solaris because Solaris requires much more virtual memory when using the fork() command because it does not have a memory overcommit feature. Linux, AIX, and HP-UX operating systems have a feature called a memory overcommit, or lazy swap allocation. In a memory over commit mode, malloc() does not reserve swap space and always returns a none NULL pointer, regardless of whether there is enough virtual memory on the system to support it or not. The swap space must be made available only when the memory is referenced. In contrast, under the Solaris OS, when the application calls malloc() and internally starts sbrk(2) to get more memory from the system, the kernel goes through its free memory lists and finds the requested amount of virtual memory. If it finds the requested amount of virtual memory, the kernel returns a pointer to that memory and reserves the swap space for it such that no other process can use it until the owner releases it. If the requested amount of virtual memory is not found, malloc() fails with an ENOMEM error and returns a NULL pointer. For a large-memory process in Solaris, the fork() system call can fail because an inadequate amount of virtual memory because the fork() call requires twice the amount of the parent memory. This can happen even when the fork() call is immediately followed by an exec() call that would release most of that extra memory.
DataStage Jobs fail to start or perform poorly when temporary directories are large
DataStage jobs write multiple files to temporary directories which are not automatically cleaned.
Symptoms
When the number of temporary files grows large, DataStage jobs have slower performance and can hang. Sites that run DataStage for a year or more and did not cleanup the temporary directories can contain 100,000 or more files.
Causes 40
Troubleshooting Guide
Normally temporary files are cleaned up when a job ends. However, terminated jobs can leave behind files.
Symptoms
InfoSphere DataStage job with a join stage terminates with the following error Unexpected terminated by UNIX Signal 11 (SIGSEGV)
Causes
41
If the size of the record is larger than the default setting of 20 MB, the sort that is inserted for the join fails.
Causes
The following items commonly cause job corruption: v Full disk space in the /tmp UVTEMP directory or the DataStage project directory. v A 32-bit hash file becomes larger than 2GB v Power outages v System crashes v Rebooting the server while a job is running v Virus checker/scanner running while job is running v Backup running while job is running v Failures on the system
42
Troubleshooting Guide
Unable to view data or run a Parallel job with a Sequential File stage
In a DataStage Parallel job with a Sequential File stage, cannot view data or run the job.
Symptoms
The following error message is generated:
File archive: Trouble creating file
Parallel job with Sequential File stage plug-in. View data results in error like : IIS-DSEE-TFAR-00015 00:10:13 <main_program> File archive: Trouble creating file "/tmp/...." Run time results in errors like: Message Id: IIS-DSEE-TFAR-00015 Message: main_program: File archive: Trouble creating file "/tmp/...." Message Id: IIS-DSEE-TFPX-00002 Message: main_program: Fatal Error: Null archive.
Causes
The program is searching for relative path called "tmp" that is not present. This occurs on Windows installations when the project is not on the same drive as the engine.
Procedure
1. Source your dsenv file in $DSHOME (. ./dsenv) 2. Go to your project directory (../InformationServer/Project/<project name>) 3. List all files and direct them to a file (ls > myfiles.txt) - this is used to list of files for the uvbackup 4. Run the uvbackup and redirect output to null with this command: "$DSHOME/bin/uvbackup -V -f -cmdfil myfiles.txt -s uvbackupout.txt -t /dev/null 2>&1 > testing123.txt" 5. grep "WARNING:" uvbackupout.txt
Results
The output file: uvbackupout.txt will help to identify if there are any corrupted files in the project.
43
Example
Here is an example what you might see in the uvbackupout.txt file:
WARNING: Unable to open file RT_STATUS3 for reading. File not saved!
The uvbackup verifies the integrity of the files and will not backup any files that are corrupted.
Causes
This error indicates that the job is running out of scratch, temporary, or swap space.
Symptoms
The resource directories and common space contain files with names that are similar to the following file name: lookuptable.20091210.513biba
44
Troubleshooting Guide
Causes
When a job aborts, it leaves the temporary files for postmortem review in the resource directories. Temporary files are left in scratch, but lookup files are created in the resource directories. Lookup file sets are not removed. A lookup file set is similar to the following file set: /opt/IBM/InformationServer/Server/Datasets/ export.dsadm.abcdefg.P000000_F0000 A lookup file has a structure that is similar to the following file: /opt/IBM/InformationServer/Server/Datasets/ lookuptable.20091210.513biba
Causes
The lookup table is too large to fit in available memory.
Causes
Normally when a DataStage job fails with the logged error message "Could not map table file" the message ends with "not enough space". In that situation, the issue is either insufficient disk space, or table too large to map into a single process and can be resolved by the steps described in the following related technote: Information Server or DataStage job fails with "Could not map table file" For the special case where the error message ends with "invalid argument" this is typically due to an I/O error unrelated to disk full. The most common cause of the error is that one of the directories where DataStage is writing is mapped to a volume which was mounted with the CIO option.
Troubleshooting InfoSphere DataStage
45
Symptoms
Jobs that process nulls in transformer stages show different behavior when migrated from 8.1 to 8.5, even when legacy null handling is set.
Causes
In InfoSphere Information Server version 8.1 and prior versions, the job design had to explicitly handle null column values in the Transformer stage. If the Parallel Engine encountered null column values outside of specific contexts, the entire row containing the null was dropped, or sent to a reject link if the Transformer stage had a reject link. Note: NOTE: This topic refers to SQL value NULL, not the character with value 0x00, and not an empty string. Explicit null handling made Transformer stage coding too complex and allowed inconsistent behavior. In InfoSphere Information Server version 8.5, the default behaviors were changed and explicit null handling was no longer required. It was recognized that some customers would want to retain the original null-handling behavior so an environment variable, APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING, was introduced. The environment variable, when defined, preserves compatibility with the behavior of pre-version 8.5 InfoSphere Information Server.
46
Troubleshooting Guide
Since version 8.5 shipped, differences in the default null handling behavior of Version 8.5 and problems with the implementation of the InfoSphere Information Server 8.1 compatibility mode have been discovered. There have been issues with null handling in InfoSphere Information Server 8.5 with backward compatibility enabled. There have also been issues with null handling in InfoSphere Information Server 8.1, and earlier versions. Most of these issues were due to lack of clear explanation about how null values should be handled in the Transformer stage.
Check for NULL value To test if a value is NULL in a logical expression, use one of these two functions. v IsNotNull() v IsNull() For example:
DSLink3.OrderCount + 1 --> If DSLink3.OrderCount is NULL, record will be dropped or rejected. This expression can be changed to: If(IsNotNULL(DSLink3.OrderCount) Then DSLink3.OrderCount + 1 Else 1 --> If DSLink3.OrderCount is NULL, the target field will be the integer 1 .
Troubleshooting InfoSphere DataStage
47
Each nullable column in a given expression needs to be properly NULL checked or the NULL value needs to be converted to a concrete value. IF-ELSE operations on NULL values Handling NULL values in IF-ELSE conditions can be complex. Consider the following examples to get familiar with using NULL checks in IF-ELSE statements. Example 1: Simple IF ELSE statement
If (DSLink1.Col1 > 0) Then xxx Else yyy In InfoSphere Information Server 8.5 code will be generated to drop records in case DSLink1.Col1 is NULL.
The condition which contains the nullable column should be properly (order should be clearly specified using parentheses where ever needed) pre-"AND"ed with IsNotNull() check or pre-"OR"ed with IsNull() check.
48
Troubleshooting Guide
Example 3: IF ELSE statement in which the nullable field is used multiple times
If ((DSLink1.Col1 = 3) or (DSLink1.Col1 = 5)) Then xxx Else yyy Records will be dropped in case Col1 is NULL.
Each instance of nullable field must be pre-"AND"ed or pre-"OR"ed with NULL check. Example 4: Using 2 nullable columns in a condition.
If (DSLink1.Col1 = DSLink1.Col1) Then xxx Else yyy
Both the columns must be NULL checked or NULL conversion functions should be used on both the columns.
If (IsNotNull(DSLink1.Col1) and (IsNotNull(DSLink1.Col2) and (DSLink1.Col1 = DSLink1.Col1))) Then xxx Else yyy
InfoSphere Information Server 8.5 NULL handling: Explicit handling not required In InfoSphere Information Server 8.5, NULL value in that input column will NOT cause the row to be dropped nor will it be sent to reject link. A NULL value in that input column will be handled by the Transformer stage, following specific logic. The job designer can skip explicit NULL handling. For further details please see the following technote:https:// www.ibm.com/support/docview.wss?uid=swg21514921 The environment variable APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING can be used, in case the designer wants to have 8.1 behavior in 8.5. Enabling old NULL handling can be done at the three following stages: v Setting APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING at project level v Setting APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING at job level v Checking "Legacy null processing" in the DataStage Designer for individual Transformer stages in a given job IBM has recently discovered a previously undocumented difference in behavior of InfoSphere Information Server 8.1 and InfoSphere Information Server 8.5 with old NULL handling enabled. InfoSphere Information Server 8.5 allowed the three "NullToxxxx()" functions to be used as Null tests. The following is an example of the IF-ELSE condition:
If ((NullToZero(DSLink1.Col1) = 0) or (DSLink1.Col1 > 0)) Then xxx Else yyy
The (NullToZero(DSLink1.Col1) = 0) section is considered as a NULL check and records were not dropped or sent to the reject link because of inconsistency in the code. In InfoSphere Information Server 8.5 this code
Troubleshooting InfoSphere DataStage
49
inconsistency was eliminated and only IsNull() and IsNotNull() can be used as Null checks. InfoSphere Information Server 8.1 Jobs which used NullToZer0(), NulltoValue(), or NulltoEmpty() for null checking must be changed to use IsNull() or IsNotNull(). Note: !IsNull() can be used instead of IsNotNull(), and !IsNotNull() can be used instead of IsNull() in NULL checking.
DataStage jobs that compiled on previous versions, have transformer compile errors v8.1
Symptoms
Your DataStage v7 jobs which have a transformer stage where a stage variable is set to null compile successfully. However, the same job on Information Server DataStage v8.1 fails with the following compile error:
<transform> Error when checking composite operator: Setting null to this non-nullable field: StageVar0_myStageVariable
Causes
The warning occurs when the field comes in with Nullable=Yes and the output gets set to the value of that field, and the output field is Nullable=No. Setting a constraint in the database does not allow the incoming value to be Null, but it does not eliminate the warning
50
Troubleshooting Guide
InfoSphere DataStage: Problems when running multiple instances of a job from a job sequence, or from a script that uses dsjob.
IBM InfoSphere DataStage: There are a number of related problems that can occur when you are running multiple instances of a job either from job sequence, or from a script that sequences jobs by using the dsjob command.
Symptoms
v Multiple instances of a job are run from a sequence or a script, and the sequence reports status=99 for one or more of the job instances. v Multiple instances of a job are run from a sequence or a script, and the job instances take a long time to start and to finish. v More than 25 instances of a job are run from a sequence (or a script) and the sequence reports status=99 for one or more of the job instances. v The system does not have enough resources because of a heavy work load and the sequence reports an error code=-99 for a parallel job. v On Intel RedHat and Suse platforms, jobs can hang despite having successfully run the underlying osh code. v Some jobs run with missing parameters, or parameters erroneously set to default values.
51
60). For example, if a value of 120 is required for DSWaitResetStartup, then ensure that DSWaitStartup is also set to a minimum of 120.) v Environment variable: DS_NO_INSTANCE_PURGING If the system is under extreme load, it might be necessary to use the DS_NO_INSTANCE_PURGING environment variable if Status=99 errors still occur when running many multi-instance jobs and auto-purge is enabled. This environment variable must be set to 1. This stops the auto-purge from deleting the status records for the job instance, allowing the controlling job to read its status when system resource becomes available. (In other situations, you might want clean logs with no persistent instance entries, so the default behavior is to purge instance entries.) v Client change, and environment variable DSJobStartedMax: The number of recorded Instance Identifiers increased from 25 to 100, to prevent Status records from being purged when more than 25 Instances are being run simultaneously. If you are using N-Instance auto-purging, and running more than 25 simultaneous Instances, then the N-Instance auto-purge limit must be set to more than 25 in the Director or Administrator. If more than 100 Instances are to be run simultaneously, then the environment variable DSJobStartedMax must be set to the required value, up to a maximum of 9999. The APAR number for this issue is APAR JR30015
Parallel job that runs on a remote node fails with broken pipe in IBM InfoSphere Information Server
Symptoms
You run a parallel job in a cluster environment with a remote node and receive the following message in the job log:
main_program: Fatal Error: Service table transmission failed for node2 (<node name>-svc:Broken pipe.
52
Troubleshooting Guide
use the resource tracker to gather machine statistics from the machine where jobs are running. These statistics have no effect on the actual job that is running, and are unrelated to the information that is captured in the DataStage job monitor or job log. If you are not currently using the resource tracker functionality, it can easily be turned off to avoid this problem. To turn off the resource tracker, add the following APT_DISABLE_TRACKER_STARTUP environment variable at the project level, and set the default value to 1.
Symptoms
A parallel job fails with one of the following error messages:
ds_ipcopen() - call to OpenFileMapping() failed - The system cannot find the file specified ds_ipcput() - timeout waiting for mutex
Causes
In systems with a heavy load, timeouts can cause the system to fail.
Causes
The error usually indicates that a resource issue is the cause of the problem.
53
process is 30 seconds. The default for loading a score is 120 seconds. Set the following environment variable at the project level: APT_PM_NODE_TIMEOUT=300 You can increase the value of this environment variable to 600 if 300 does not resolve the problem. If adding the APT_PM_NODE_TIMEOUT environment variable does not correct the issue, monitor processor, disk space, memory, and swap when the job is running. Check with your network administrator to see if the nodes are on a SAN or NFS mount.
You receive an error similar to SQL0443N Routine "SYSIBM.SQLCOLUMNS" (specific name "COLUMNS") has returned an error SQLSTATE with diagnostic text "SYSIBM:CLI:-805". SQLSTATE=38553 when you run a job.
55
56
Troubleshooting Guide
The connector cannot find any available nodes in the APT configuration file
Symptoms
When you run a job you receive the following error: The connector could not find any available nodes in the APT configuration file. This usually occurs when a node pool constraint is specified for a connector stage, but it is not defined in the APT configuration file (CC_DB2Configuration::validateEnvironment, file CC_DB2NodeNegotiation.cpp, line 787
Join Stage
Join stage exits with a heap allocation error
Symptoms
The join stage exits with a heap allocation error.
Causes
The Join Stage process the data using the primary link or driving data sets first. It gets a row from this link and then retrieves all rows with matching values in the secondary link (also called reference link.) These rows are temporarily stored in memory. If there is a large number of rows in the secondary link that match the current record in the primary link then a large amount of memory is allocated to hold this result. If there is not enough heap memory the job will fail. Because of this, the amount of memory used by the Join stage depends, among other factors, on the cardinality of the reference side. The lowest the cardinality of the reference side, the larger the amount of memory it will use for each row. Note: Cardinality is here the uniqueness of data values of column. A column with all unique values is said to have high cardinality, while a column with repeated values is said to have low cardinality.
57
Changing this link order will not impact the outcome of the Join because Inner Joins and Full Outer Joins are symmetric. in other words, the sides are interchangeable and the order does not impact the outcome of the Join. If you are working with non-symmetric joins such as Left Outer and Right Outer, then changing the link order does have an impact in the output of the data and therefore you should not change the link order unless you understand the consequences of this change. For a more detailed explanation about the differences between the type of joins this stage can perform please refer to the "Parallel Job Developer Guide".
Causes
If the size of the record is larger then the default setting of 20 MB the sort inserted for the join fails.
Lookup Stage
InformationServer DataStage Lookup Stage fails on Linux
Symptoms
InformationServer DataStage Lookup Stage fails on the Linux operating system.
Causes
Lookup stage creates files in the resource disk area that use the C++ MMAP function. When those files are used on an NFS or shared mount the MMAP function may fail. This is a known issue on Linux and is due to the C++ libraries not DataStage.
Parallel job with lookup aborts with a File too large error
Symptoms 58
Troubleshooting Guide
A DataStage parallel job that contains a lookup fails with the following error: Lookup_107,0: Error writing table file "/d01/Ascential/DataStage/Datasets/ lookuptable.20100217.abcde": File too large
Causes
The lookup table is too large to fit in available memory.
Causes
The error message "Error finalizing / saving table" when writing to a dataset generally occurs for one of several reasons: v The userid running the DataStage job does not have permission to write to the directory shown in error. The userid is identified in each event message in job log for the failing job. v The volume containing the output directory stated in error message does not have enough free space to write the file. v An "out of memory" error precedes the dataset write error in log in DataStage 8.1 (in DataStage 7.5.x, the only error is the space error; there is no additional memory-related error message even when that is the cause of failure writing lookup dataset). v Output of temporary datasets may occur to the directory specified by the UVTEMP setting in uvconfig file. Parallel jobs can also write output to the directory specified by tmpdir environment variable. Ensure those directories have sufficient space for the file.
59
This memory error typically occurs when building large datasets due to the need to load the dataset first into memory. The lookup stage uses memory mapped files. You must have enough system memory available to store the entire contents of the file AND enough disk space to shadow the file in memory. In a 32-bit system, there will be 2GB limit on size. Refer to the next troubleshooting tip for more information on dealing with the "could not map table file" error:
DataStage job fails with "Could not map table file"
Information Server or DataStage job fails with "Could not map table file"
Symptoms
Information Server or DataStage job fails with the following error: Could not map table file. DataStage is failing when trying to load lookup data into memory or create lookup file.
Causes
DataStage is limited to the amount of memory that can be allocated for a Lookup.
60
Troubleshooting Guide
As the amount of Lookup data increases, you can add more nodes to the configuration file to further distribute data across more processes and thus more memory segments. There are a lot of factors that lead to a job that runs successfully then starts to fail. Like length of time since the server was rebooted (memory fragmentation), or the amount of data used in the Lookups may have grown to just over the limit. There may be other applications that are running concurrently using resources that used to be available to DataStage. There could have been a re-configuration of available memory by creating or moving lpars. The ** LDR_CNTRL environment setting on AIX may limit the ulimit -d (data) setting even if you have hard limit set higher.
Lookup file does not span across multiple scratch resource disks
Symptoms
The lookup file generated by a DataStage lookup does not span across multiple scratch resource disks that are defined per node when scratch fills up.
Causes
DataStage lookup files are memory mapped files, so there can only be one file per lookup process.
Symptoms
The following error message is generated:
File archive: Trouble creating file
Parallel job with Sequential File stage plug-in. View data results in error like : IIS-DSEE-TFAR-00015 00:10:13 <main_program> File archive: Trouble creating file "/tmp/...." Run time results in errors like: Message Id: IIS-DSEE-TFAR-00015 Message: main_program: File archive: Trouble creating file "/tmp/...." Message Id: IIS-DSEE-TFPX-00002 Message: main_program: Fatal Error: Null archive.
Causes
The program is searching for relative path called "tmp" that is not present. This occurs on Windows installations when the project is not on the same drive as the engine.
61
Create a directory called "tmp" at the root of the drive where the DataStage project is located. For example, if the DataStage projects are on the D: drive, create the following directory: D:\tmp If the directory exists, check the remaining disk space on your drives to ensure that limited disk space is not the cause of the problem.
Causes
The issue is due to the incorrect settings of the Teradata COPLIB and COPERR environment variables.
Causes
This problem is caused when the ASB agent running on the engine machine is unable to find the connector library or the Teradata libraries. The agent is started using the /local/IBM/InformationServer/ASBNode/bin/NodeAgents script. This script will source the /local/IBM/InformationServer/ASBNode/bin/ NodeAgents_env_DS.sh file for any DS specific environment. This script is sourcing the /local/IBM/InformationServer/Server/DSEngine/dsenv. The script will eventually invoke the Agent.sh script to start the agent.
Causes
When DataStage runs write operations in parallel mode through immediate-mode or through the stream-operator, row-hash collisions might occur. These collisions can cause blocking and deadlocks.
62
Troubleshooting Guide
Any updates for a particular row must come from the same partition to avoid a deadlock. Each partition uses a separate connection to the database. If multiple updates for a row do not come from the same connection, it would cause blocking or deadlocks.
Sort Stage
DataStage outputs a warning message about a partition key Sort key "CO_ID" no longer exists in dataset schema
Symptoms
DataStage outputs a warning message about a partition key. main_program: Sort key "CO_ID" no longer exists in dataset schema. It will be dropped from the inserted sortmerge collector. main_program: There are no sort keys in the dataset schema. No parallel sortmerge operator will be inserted.
Transformer Stage
Checking composite operator errors
Symptoms
DataStage jobs containing a transformer stage call an external C compiler during DataStage job compile correctly. Under some conditions, the compile might fail with a large number of composite operator errors that are similar to the following errors: v ##E IIS-DSEE-TBLD-00076 15:20:18(000) <main_program> Error when checking composite operator: Subprocess command failed with exit status 256. v ##W IIS-DSEE-TFTM-00012 15:20:18(002) <transform> Error when checking composite operator: The number of reject datasets "0" iless than the number of input datasets "1". v <p>##W IIS-DSEE-TBLD-00000 15:20:18(007) <main_program> Error when checking composite operator: Output from subprocess: Error 8: "/usr/include/machine/sys/_types.h", line 65 # Invalid type specifier combination in declaration: "short double". </p>
63
Causes
If a large number of errors occur when "checking composite operator", that is often an indication that the compiler that is used with DataStage is incompatible, unsupported, or uses the incorrect compiler or linker options.
DataStage jobs with transformer stage fail to compile on AIX due to many missing include files
Symptoms
When compiling a DataStage job containing transformer stage on AIX, the compile fails with the following errors: v ##W IIS-DSEE-TBLD-00000 17:52:00(010) <main_program> Error when checking composite operator: Output from subprocess: "/opt/IBM/InformationServer/ Server/PXEngine/include/apt_components/transformop/transformbasehdrs.h", line 41.10: 1540-0836 (S) The #include file <map> is not found. v "/opt/IBM/InformationServer/Server/PXEngine/include/apt_framework/ operator.h", line 70.10: 1540-0836 (S) The #include file <vector> is not found. v "/opt/IBM/InformationServer/Server/PXEngine/include/apt_util/custreport.h", line 36.10: 1540-0836 (S) The #include file <string> is not found.a v "/opt/IBM/InformationServer/Server/PXEngine/include/apt_util/ iostream_s.h", line 23.10: 1540-0836 (S) The #include file <iostream.h> is not found.
64
Troubleshooting Guide
v The #include file <vector> is not found. v The #include file <string> is not found. v The #include file <iostream.h> is not found.
Datastage error when importing job with Transformer stages into a different system
Symptoms
The following error message is received when importing job with Transformer stages into a different system:
RT_BP123.O/V0S11_TESTJOB1_Transformer.C is of unknown format Error processing file RT_123.O/V0S11_TESTJOB1_Transformer.C. file not modified Command CATALOG RT_BP123 V0S11_TESTJOB1_Transformer.C V0S11_TESTJOB1_Transformer.C LOCAL FORCE error: Program V0S11_TESTJOB1_Transformer.C was not compiled with a supported version of the BASIC compiler. It must be recompiled.
Causes
This issue is related to having the environment variable APT_TRANSFORM_OPERATOR_DEBUG set. Once this environment variable is set, the file with the "C" code is maintained in the BP.O directory. When you then export with Job Executables, the .C file will be included in the export. Once this binary section is included, you always get this error on import.
Causes
NLS settings allow non US ASCII characters to be used in Transformer Stage Naming. However, the NFS or OS doesn't recognize non-US ASCII characters in module names.
65
Jobs with Transformer stage that use remote nodes abort with fatal error
Symptoms
Datastage Parallel job with Transformer stage using remote nodes fails with the following error:
Item #: 19 Event ID: 126 Timestamp: 2010-06-23 13:37:03 Type: Fatal User Name: t2etl01 Message: trn: Failed to distribute the shared library "/datastage/DataStage/Projects/ProjectName/RT_BP123.O/V10S0_xxxxxxxx_trn.o"to node "nodeName". [transform/transform.C:1827]
Causes
This error is a result of the projects directory either not in existence or not accessible on the remote node.
Director logs are not showing warning messages for all records dropped by the Transformer stage
Symptoms
DataStage Director logs are not showing warning messages for all records dropped by the Transformer stage.
Causes
DataStage displays only 50 warning messages in the Director log per node when records are dropped by the Transformer stage. When more than 50 records are dropped by a Transformer stage DataStage Director log displays up to only 50 warning messages per node before going into silent mode with the warning message "Warning, all other rejected records will be silent".
66
Troubleshooting Guide
4. Set "Maximum log reject messages" to the required value or set to -1 for unlimited. If no value is specified, Maximum log reject messages defaults to 50 messages per node.
Causes
This error may be caused by an "unbreakable space" being entered into the column definition of the transformer. This is typically caused by "Copying and Pasting" from a Microsoft Word or Excel document. The DataStage Designer prohibits inserting a normal space, but does not check for "unbreakable spaces".
Causes
DataStage 8.x has been coded to maintain backwards compatibility for how nulls are handled in the transformer stage. For previous releases that use a transformer stage and use smallint and bigint data types have produced an error message. This is caused by the transformer property called "Legacy null processing" that will be automatically checked/set when the job is imported to DataStage 8.x. DataStage jobs that are created using the transformer stage and not imported from a previous release will not have the "Legacy null processing" option set.
67
68
Troubleshooting Guide
message if its producing stage is hash partitioned: "Sequential operator cannot preserve the partitioning of the parallel data set on input port 0. " v These issues can be worked around by setting the environment job parameters APT_NO_PART_INSERTION=True and APT_NO_SORT_INSERTION=True and then modifying the job to ensure that the partitioning and sorting requirements are met by explicit insertion. Default Decimal Separator Information Server releases affected: 8.0.1 Fix Pack 1 and higher, 8.1 Fix Pack 1 and higher, 8.5 GA Prior to Information Server Version 8.0.1 Fix Pack 1, the default decimal separator specified via Job Properties->Defaults was not recognized by the APT_Decimal class in the parallel framework. This caused problems for the DB2 API stage where decimals with a comma decimal point could not be processed correctly. This issue was fixed in release 8.0.1 Fix Pack 1, as APAR JR31597. The default decimal separator can be specified via a job parameter (e.g. #SEPARATOR#). However, if the job parameter does not contain any value, '#' will be taken as the decimal separator. This can cause the following error if the actual decimal separator is not '#':
Fatal Error: APT_Decimal::assignFromString: invalid format for the source string.
If you encounter this problem after upgrading, please make sure the job parameter representing the default decimal separator contains the actual decimal separator character used by input data. If changing the job parameter is not an option, you can set the environment variable APT_FORCE_DECIMAL_SEPARATOR. The value of APT_FORCE_DECIMAL_SEPARATOR overrides the value set for the "Decimal separator" property. If more than 1 character is set for this environment variable, the decimal separator will default to a dot character, . Embedded Nulls in Unicode Strings Information Server releases affected: 8.1 Fix Pack 1 and higher, 8.5 GA Prior to Information Server 8.1 Fix Pack 1, nulls embedded in Unicode strings were not treated as data, but rather they were treated as string terminators. This caused data after the first null to be truncated. The issue was fixed in Fix Pack 1, as APAR JR33408 for Unicode strings that were converted to or from UTF-8 strings. As a result of this change, you may observe a change in job behavior where a bounded-length string is padded with trailing nulls. These extra nulls can change the comparison result of two string fields, generate duplicate records, make data conversion fail, etc depending on the job logic. To solve this problem, the job should be modified to set APT_STRING_PADCHAR=0x20 and call Trim() in transformer stage if needed. Null Handling at column level Information Server releases affected: 8.1 GA and higher, 8.5 GA In parallel jobs, nullability is checked at runtime. It is possible for the user to set a column as nullable in the DataStage Designer, but at runtime the column is actually mapped as non-nullable (to match the actual database table) for example. Prior to 8.1 GA the parallel framework issued a warning for this mismatch, but the job would potentially crash with a segmentation violation as a result. The warning was changed to a fatal error in 8.1 GA as ECASE 124987 to prevent the job from aborting with
Troubleshooting InfoSphere DataStage
69
SIGSEGV. After this change, jobs that used to run with this warning present will now abort with a fatal error. For an example, this problem is often seen in the lookup stage. To solve the problem, modify the job to make sure the nullability of each input field of the lookup stage matches the nullability of the same output field of the stage which is upstream to the lookup. Transformer Stage: Run-Time Column Propagation (RCP) DataStage releases affected: 7.5 and higher Information Server releases affected: 8.0 GA and higher, 8.1 GA and higher, 8.5 GA When RCP is enabled at any DataStage 7.x release prior to 7.5, for an input field "A" which is mapped to an output field "B", both "A" and "B" are present in the output record. Starting with DataStage 7.5, it appears that "A" is simply being renamed to "B", so that only "B" appears in the output. In order to improve transform performance, a straight assignment like "B=A" is considered as renaming "A" to "B". Prior to the change, the straight assignment was considered as creating an additional field by copying "A" to "B". With this change in place, the user now needs to explicitly specify both "A" and "B" in the output schema in order to prevent "A" from being renamed to "B" and to create a new field "B". Refer to the following Transformer stage screen-shot that shows how to ensure that both values are propagated to the output link. Transformer Stage: Decimal Assignment Information Server releases affected: 8.0 GA and higher, 8.1 GA and higher, 8.5 GA The parallel framework used to issue a warning if the target decimal had smaller precision and scale than the source decimal. The warning was changed to an error in Information Server 8.0 GA, and as a result the input record will be dropped if a reject link is not present. This behavior change was necessary to catch the error earlier to avoid data corruption. The user should modify the job to make sure the target decimal is big enough to hold the decimal value. Alternatively, the user can add a reject link to prevent records from being dropped. Important: This change in behavior does not apply to any Linux platforms (Redhat, Suse or zLinux.) The parallel framework does not enable exception handling on Linux platforms, so the behavior remains the same as it was prior to 8.0 GA. Transformer Stage: Data Conversion Information Server releases affected: 8.0 GA and higher, 8.1 GA and higher, 8.5 GA Prior to Information Server 8.0 GA, an invalid data conversion in the transformer would result in the following behavior: v A warning message is issued to the DataStage job log v A default value was assigned to the destination field according to its data type v The record was written to the output link v If a reject link was present, nothing was sent to the reject link
70
Troubleshooting Guide
The behavior has changed in the 8.0 GA release when a reject link is present. Instead of the record being written to the output link with a default value, it will be written to the reject link instead. This may lead to data loss if the job is expecting those records to be passed through to the output. To get to the original behavior of passing the records through, the job would need to be modified to remove the reject link. An environment variable was added along with this change, to add the capability of aborting the job. To use this option, ensure that there is no reject link and then set the environment variable APT_TRANSFORM_ABORT_ON_CONVERSION_ERROR=True. The job will now abort from an invalid data conversion scenario. Surrogate Key Generator Information Server releases affected: 8.0.1 Fix Pack1 and higher, 8.1 Fix Pack 1 and higher, 8.5 GA The surrogate key stage reserves keys in blocks. Prior to Information Server 8.1 Fix Pack 1, if only one record (suppose it was value 5, because an initial value was set) was generated, the surrogate key generator would use values beginning with 6 and greater as available keys for incoming records. The surrogate key generator was changed in 8.1 Fix Pack 1, as APAR JR29667. With this change, DataStage will now consider values 1 to 4 as well as any value 6 and greater as available keys. This behavior change may cause the SCD stage to produce incorrect results in the database or generate the wrong surrogate keys for the new records of the dimension. If required, the job can be modified to revert back to the old behavior (start generating keys from the highest key value last used) by setting option 'Generate Key From Last Highest Value' to Yes. This approach however may result in gaps in used keys. It is recommended that the user understand how the key file is initialized and decide if it is necessary to modify job based on business logic. Sequential File Format on Windows Information Server releases affected: (Windows Platforms) 8.1 GA and higher, 8.5 GA Prior to Information Server 8.1 GA, the default format for sequential files was Unix format which requires a newline character as the delimiter of a record. The default format for the Sequential File stage was changed to Windows format in the Information Server 8.1 GA release. Due to this change, data files previously created with UNIX format will not import properly. To solve this issue, set the environment variable APT_USE_CRLF=FALSE at the DataStage project level or within the system environment variables (requires a Windows reboot).
Symptoms
DataStage Jobs that use data sets become very slow over a period.
Troubleshooting InfoSphere DataStage
71
Causes
Datasets use a sync() call, but according to solaris it should be making fsync() call.
Environment
This problem occurs only in cluster envornments. The same jobs that are experiencing bottlenecks run normally in a non-cluster environment.
Heap allocation errors with DataStage Parallel Jobs on the AIX platform
Symptoms
DataStage parallel jobs ends with the following error message:
APT_BadAlloc: Heap allocation failed.
Causes
AIX divides memory address space into segments. If the DataStage jobs need to allocate more memory than exists in the number of available segments, the job ends with a heap allocation or failure to allocate memory.
72
Troubleshooting Guide
1. Create a new Parallel job. 2. Add an External Source stage (under File on the palette) connected to a peek stage (under Development/Debug on the palette). 3. Access the advanced properties of the External Source stage, and make sure its running in Parallel mode. 4. In the External source stage, enter 'ulimit -a; ulimit -aH' without the quotations in the Source Program property and a column as VarChar with length of 255. Use a configuration file that includes at least one node for each fast name (host) in your cluster or GRID. 5. Compile the job, run it, and look in the Director log. The director contains the soft limits and the hard limits for each node in the configuration file. If the hard limit for data is too low you need to contact your AIX administrator to increase that value. This value can be set in the file /etc/security/limits. 6. After you increase the hard limit settings, you can set the ulimit settings for the user in the ds.rc file located under $DSHOME/sample. You can add a line like this ulimit -d unlimited at the beginning of the file, after the umask settings. The ds.rc file is owned by root, and writable only to root, so your system administrator must change the file permissions. For security reasons, DO NOT change the owner or grant write permission to any non-root user. Important: do not set the number of file descriptors (ulimit -n) to unlimited. That setting causes a problem with DataStage. Ensure that the value for this limit is set sufficiently high; ulimit -n 100000 is a safe value in nearly all situations. DataStage Version 7.5.x The DataStage software is a 32-bit application for all 7.5.x releases, even when installed onto an AIX server with a 64-bit kernel. To obtain the maximum amount of process address space for your parallel job processes, set the LDR_CNTRL variable with the following value: MAXDATA=0x80000000@DSA as the default value at the project level (for all jobs in a project) or within specific jobs. Do not add LDR_CNTRL to your dsenv file. That setting might interfere with the memory model used by the Server Engine. In DataStage Version 8.0.x, the DataStage software is a 32-bit application for all 8.0.x releases, even when installed onto an AIX server with a 64-bit kernel. Starting with the Information Server 8.0 GA release, DataStage now starts Java components to integrate with the services tier. For these Java components to function properly, the LDR_CNTRL=MAXDATA=0x60000000@USERREGS environment variable is added to the dsenv file. It is important that this variable is not removed or modified to ensure the proper operation of the Java components. For parallel jobs that require more than 1.5gb of memory per process, the LDR_CNTRL variable can be set to a larger value. This variable must be given a default value at the project level if you want it to take effect for all jobs in the project, or by leaving the project default value blank and assigning a value to specific jobs only. As stated previously, DO NOT alter LDR_CNTRL within the dsenv file. To obtain the maximum amount of process address space for your job processes, set the LDR_CNTRL variable with the following value: MAXDATA=0x80000000@DSA in your job or as project default. DataStage Version 8.1.x Starting with the 8.1 GA release, DataStage is now a 64-bit application and requires a 64-bit AIX kernel. The osh item is compiled with the MAXDATA=x80000000 property, so the amount of memory address space available to the parallel job process is limited to 2 GB in the default configuration. The improvement of being a 64-bit application allows for the allocation of more segments and a larger private memory address space. For situations where large
Troubleshooting InfoSphere DataStage
73
amounts of heap memory are required for each process, set LDR_CNTRL to the value MAXDATA=0x0000001000000000. This value allocates up to 64 Gb for private data for each process. Set this large value at the job level rather than at the project level to avoid large consumption of memory by jobs where you did not intentionally want this behavior. In DataStage Version 8.5, DataStage is a 64-bit application and will require a 64-bit AIX kernel just like release 8.1. A significant improvement at this release, is that the MAXDATA parameter has been removed from the executable. With this change, DataStage is now able to access all of the available memory address segments in the default configuration. Any jobs or projects that had LDR_CNTRL specified with the MAXDATA parameter should be modified to remove this parameter after you upgrade to 8.5 so that you are able to access all of the segments. Important: the LDR_CNTRL=USERREGS environment variable MUST NOT be removed from the dsenv; this is required for proper operation of java components loaded by DataStage processes. The USERREGS property will not impact the memory utilization of DataStage jobs.
74
Troubleshooting Guide
l-wx-----l-wx-----lrwx------
1 root dstage 64 Sep 25 08:24 1 -> /dev/null 1 root dstage 64 Sep 25 08:24 2 -> /dev/null 1 root dstage 64 Sep 25 08:24 3 -> socket:[12928306]
The dsrpcd process (23978) has four files open. T30FILE This parameter determines the maximum number of dynamic hash files that can be opened system-wide on the DataStage system. If this value is too low, expect to find an error message similar to 'T30FILE table full'. The following engine command, executed from $DSHOME, shows the number of dynamic files in use:
echo "`bin/smat -d|wc -l` - 3"|bc
Use this command to assist with tuning the T30FILE parameter. See the following technote: https://www-304.ibm.com/support/ docview.wss?uid=swg21390117 Every running DataStage job requires at least 3 slots in this table. (RT_CONFIG, RT_LOG, RT_STATUS). Note, however, that multi-instance jobs share slots for these files, because although each job run instance creates a separate file handle, this just increments a usage counter in the table if the file is already open to another instance. Note that on AIX the T30FILE value should not be set higher than the system setting ulimit -n. GLTABSZ This parameter defines the size of a row in the group lock table. Tune this value if the number of group locks in a given slot is getting close to the value defined. Use the LIST.READU EVERY command from the server engine shell to assist with monitoring this value. LIST.READU lists the active file and record locks; the EVERY keyword lists the active group locks in addition. For example, with a Designer client and a Director client both logged in to a project named dstage0:
Active Group Locks: Record Group Group Group Device.... Inode..... Netnode Userno Lmode G-Address. Locks ...RD ...SH ...EX 838222719 2039334646 0 5620 62 IN 800 1 0 0 0 Active Record Locks: Device.... Inode..... Netnode Userno Lmode 838222719 2039334646 0 64332 62 RL 838222719 2039334646 0 62412 62 RL PID Item-ID..................... 1204 dstage0&!DS.ADMIN!& 3124 dstage0&!DS.ADMIN!&
Device A number that identifies the logical partition of the disk where the file system is located Inode A number that identifies the file that is being accessed Netnode A number that identifies the host from which the lock originated. 0 indicates a lock on the local machine, which will usually be the case for DataStage. If other than 0, then on Unix it is the last part of the TCP/IP host number specified in the /etc/hosts file; on
75
Windows it is either the last part of the TCP/IP host number or the LAN Manager node name, depending on the network transport used by the connection. Userno The phantom process that set the lock Pid A number that identifies the controlling process
Item-ID The record ID of the locked record Lmode The number assigned to the lock, and a code that describes its use G-Address Logical disk address of group, or its offset in bytes from the start of the file, in hex Record Locks The number of locked records in the group Group RD Number of readers in the group Group SH Number of shared group locks Group EX Number of exclusive group locks When the report describes record locks, it contains the following Lmode codes: FS, IX, CR Shared file locks FX, XU, XR Exclusive file locks When the report describes group locks, it contains the following Lmode codes: EX SH RD WR IN Exclusive lock Shared lock Read lock Write lock System information lock
When the report describes record locks, it contains the following Lmode codes: RL RU RLTABSZ This parameter defines the size of a row in the record lock table. From a DataStage job point of view, this value affects the number of concurrent DataStage jobs that can be executed, and the number of DataStage Clients that can connect. Shared record lock Update record lock
76
Troubleshooting Guide
Use the LIST.READU command from the DSEngine shell to monitor the number of record locks in a given slot. With one Director client logged in to a project named dstage0, and two instances of a job in that project that are running, the active record locks are similar to the following example:
Active Record Locks: Device.... Inode..... Netnode Userno Lmode 838222719 2039334646 0 64332 62 RL 838222719 2039334646 0 62128 62 RL 838222719 2039334646 0 65252 62 RL 304877956 328255620 0 62128 62 RL 304877956 328255620 0 65252 62 RL Pid 1204 3408 284 3408 284 Item-ID............. dstage0&!DS.ADMIN!& dstage0&!DS.ADMIN!& dstage0&!DS.ADMIN!& RT_CONFIG456 RT_CONFIG456
In the above report, Item-ID=RT_CONFIG456 identifies that the running job is an instance of job number 456, whose compiled job file is locked while the instance is running so that, for example, it cannot be re-compiled in that time. A jobs number within its project can be seen in the Director job status view, the detail dialog, for a particular job. The unnamed column in-between UserNo and Lmode relates to a row number within the Record Lock table. Each row can hold RLTABSZ locks. In the above example, 3 slots out of 75 (Default value for RLTABSZ) have been used for row 62. When the number of entries for a given row gets close to the RLTABSZ value, it is time to consider re-tuning the system. Jobs can fail to start, or generate -14 errors, if RLTABSZ is being reached. DataStage Clients may see an error message similar to 'DataStage Project locked by Administrator' when attempting to connect. Note that the error message can be misleading - it means in this case that a lock cannot be acquired because the lock table is full, and not because another user already has the lock. MAXRLOCK This parameter must always be set to the value of RLTABSZ 1. Each DSD.RUN process takes a record lock on a key name <project>&!DS.ADMIN!& of the UV.ACCOUNT file in $DSHOME (as seen in the examples above). Each DataStage client connection (for example, Designer, Director, Administrator, dsjob command) takes this record lock as well. This is the mechanism by which DataStage determines whether operations such as project deletion are safe, operations cannot proceed while a project lock is held by any process. MAXRLOCK needs to be set to accommodate the maximum # of jobs and sequences plus client connections that will be used at any given time. And RLTABSZ needs to be set to MAXRLOCK + 1. Keep in mind that changing RLTABSZ greatly increases the amount of memory needed by the disk shared memory segment. Customer Support has reported in the past that using settings of 130/130/129 (for RLTABSZ/GLTABSZ/MAXRLOCK, respectively) work successfully on most customer installations. There have been reports of high-end customers using settings of 300/300/299, so this is environment specific. If sequencers or multi-instance jobs are used, start with the recommended settings of 130/130/129, and increase to 300/300/299 if necessary. Prior to DataStage v8.5 the following settings were pre-defined: v MFILES = 150
Troubleshooting InfoSphere DataStage
77
v v v v
DataStage v8.5 has the following settings pre-defined: v MFILES = 150 v T30FILE = 512 v GLTABSZ = 75 v RLTABSZ = 150 v MAXRLOCK = 149 (150-1) These are the lowest suggested values to accommodate all system configurations, so tuning of these values is often necessary. DMEMOFF, PMEMOFF, CMEMOFF, NMEMOFF These are the shared memory address offset values for each of the four DataStage shared memory segments (Disk, Printer, Catalog, NLS). Depending upon the platform, PMEMOFF, CMEMOFF & NMEMOFF will need to be increased to allow for a large disk shared memory to be used. Where these values are set to 0x0 (on AIX for example), the OS takes care of managing these offsets. Otherwise, the PMEMOFF - DMEMOFF = largest disk shared memory segment size. Additionally, on Solaris for example, these values will be increased to allow for a greater heap size for the running DataStage job. Note that when running the shmtest utility, great care must be taken with interpreting its output. The utility tests the availability of memory that it can allocate at the time it runs, and this will be affected both by the current uvconfig settings, how much shared memory is already in use, and other activity on the machine at the time.
78
Troubleshooting Guide
Note: The OS parameter of nofiles must be set higher than MFILES. Ideally, it would be recommended that nofiles be at least 512. This will allow the DataStage process to open up to 512 - (MFILES + 8 ) files. On most UNIX systems, the proc file system can be used to monitor the file handles opened by a given process; for example:
ps -ef|grep dsrpcd root 23978 1 0 Jul08 ? 00:00:00 /opt/ds753/Ascential/DataStage/DSEngine/bin/accdsrpcd ls -l /proc/23978/fd lrwx-----l-wx-----l-wx-----lrwx-----1 1 1 1 root root root root dstage dstage dstage dstage 64 64 64 64 Sep Sep Sep Sep 25 25 25 25 08:24 08:24 08:24 08:24 0 1 2 3 -> -> -> -> /dev/pts/1 (deleted) /dev/null /dev/null socket:[12928306]
The dsrpcd process (23978) has four files open. T30FILE This parameter determines the maximum number of dynamic hash files that can be opened system-wide on the DataStage system. If this value is too low, expect to find an error message similar to 'T30FILE table full'. The following engine command, executed from $DSHOME, shows the number of dynamic files in use:
echo "`bin/smat -d|wc -l` - 3"|bc
Use this command to assist with tuning the T30FILE parameter. See the following technote: https://www-304.ibm.com/support/ docview.wss?uid=swg21390117 Every running DataStage job requires at least 3 slots in this table. (RT_CONFIG, RT_LOG, RT_STATUS). Note, however, that multi-instance jobs share slots for these files, because although each job run instance creates a separate file handle, this just increments a usage counter in the table if the file is already open to another instance. Note that on AIX the T30FILE value should not be set higher than the system setting ulimit -n. GLTABSZ This parameter defines the size of a row in the group lock table. Tune this value if the number of group locks in a given slot is getting close to the value defined. Use the LIST.READU EVERY command from the server engine shell to assist with monitoring this value. LIST.READU lists the active file and record locks; the EVERY keyword lists the active group locks in addition. For example, with a Designer client and a Director client both logged in to a project named dstage0:
Active Group Locks: Record Group Group Group Device.... Inode..... Netnode Userno Lmode G-Address. Locks ...RD ...SH ...EX 838222719 2039334646 0 5620 62 IN 800 1 0 0 0 Active Record Locks: Device.... Inode..... Netnode Userno Lmode 838222719 2039334646 0 64332 62 RL 838222719 2039334646 0 62412 62 RL PID Item-ID..................... 1204 dstage0&!DS.ADMIN!& 3124 dstage0&!DS.ADMIN!&
79
Device A number that identifies the logical partition of the disk where the file system is located Inode A number that identifies the file that is being accessed Netnode A number that identifies the host from which the lock originated. 0 indicates a lock on the local machine, which will usually be the case for DataStage. If other than 0, then on Unix it is the last part of the TCP/IP host number specified in the /etc/hosts file; on Windows it is either the last part of the TCP/IP host number or the LAN Manager node name, depending on the network transport used by the connection. Userno The phantom process that set the lock Pid A number that identifies the controlling process
Item-ID The record ID of the locked record Lmode The number assigned to the lock, and a code that describes its use G-Address Logical disk address of group, or its offset in bytes from the start of the file, in hex Record Locks The number of locked records in the group Group RD Number of readers in the group Group SH Number of shared group locks Group EX Number of exclusive group locks When the report describes record locks, it contains the following Lmode codes: FS, IX, CR Shared file locks FX, XU, XR Exclusive file locks When the report describes group locks, it contains the following Lmode codes: EX SH RD WR IN Exclusive lock Shared lock Read lock Write lock System information lock
When the report describes record locks, it contains the following Lmode codes:
80
Troubleshooting Guide
RL RU RLTABSZ
This parameter defines the size of a row in the record lock table. From a DataStage job point of view, this value affects the number of concurrent DataStage jobs that can be executed, and the number of DataStage Clients that can connect. Use the LIST.READU command from the DSEngine shell to monitor the number of record locks in a given slot. With one Director client logged in to a project named dstage0, and two instances of a job in that project that are running, the active record locks are similar to the following example:
Active Record Locks: Device.... Inode..... Netnode Userno Lmode 838222719 2039334646 0 64332 62 RL 838222719 2039334646 0 62128 62 RL 838222719 2039334646 0 65252 62 RL 304877956 328255620 0 62128 62 RL 304877956 328255620 0 65252 62 RL Pid 1204 3408 284 3408 284 Item-ID............. dstage0&!DS.ADMIN!& dstage0&!DS.ADMIN!& dstage0&!DS.ADMIN!& RT_CONFIG456 RT_CONFIG456
In the above report, Item-ID=RT_CONFIG456 identifies that the running job is an instance of job number 456, whose compiled job file is locked while the instance is running so that, for example, it cannot be re-compiled in that time. A jobs number within its project can be seen in the Director job status view, the detail dialog, for a particular job. The unnamed column in-between UserNo and Lmode relates to a row number within the Record Lock table. Each row can hold RLTABSZ locks. In the above example, 3 slots out of 75 (Default value for RLTABSZ) have been used for row 62. When the number of entries for a given row gets close to the RLTABSZ value, it is time to consider re-tuning the system. Jobs can fail to start, or generate -14 errors, if RLTABSZ is being reached. DataStage Clients may see an error message similar to 'DataStage Project locked by Administrator' when attempting to connect. Note that the error message can be misleading - it means in this case that a lock cannot be acquired because the lock table is full, and not because another user already has the lock. MAXRLOCK This parameter must always be set to the value of RLTABSZ 1. Each DSD.RUN process takes a record lock on a key name <project>&!DS.ADMIN!& of the UV.ACCOUNT file in $DSHOME (as seen in the examples above). Each DataStage client connection (for example, Designer, Director, Administrator, dsjob command) takes this record lock as well. This is the mechanism by which DataStage determines whether operations such as project deletion are safe, operations cannot proceed while a project lock is held by any process. MAXRLOCK needs to be set to accommodate the maximum # of jobs and sequences plus client connections that will be used at any given time. And RLTABSZ needs to be set to MAXRLOCK + 1. Keep in mind that changing RLTABSZ greatly increases the amount of memory needed by the disk shared memory segment.
81
Customer Support has reported in the past that using settings of 130/130/129 (for RLTABSZ/GLTABSZ/MAXRLOCK, respectively) work successfully on most customer installations. There have been reports of high-end customers using settings of 300/300/299, so this is environment specific. If sequencers or multi-instance jobs are used, start with the recommended settings of 130/130/129, and increase to 300/300/299 if necessary. Prior to DataStage v8.5 the following settings were pre-defined: v MFILES = 150 v T30FILE = 200 v GLTABSZ = 75 v RLTABSZ = 75 v MAXRLOCK = 74 (75-1) DataStage v8.5 has the following settings pre-defined: v MFILES = 150 v T30FILE = 512 v GLTABSZ = 75 v RLTABSZ = 150 v MAXRLOCK = 149 (150-1) These are the lowest suggested values to accommodate all system configurations, so tuning of these values is often necessary. DMEMOFF, PMEMOFF, CMEMOFF, NMEMOFF These are the shared memory address offset values for each of the four DataStage shared memory segments (Disk, Printer, Catalog, NLS). Depending upon the platform, PMEMOFF, CMEMOFF & NMEMOFF will need to be increased to allow for a large disk shared memory to be used. Where these values are set to 0x0 (on AIX for example), the OS takes care of managing these offsets. Otherwise, the PMEMOFF - DMEMOFF = largest disk shared memory segment size. Additionally, on Solaris for example, these values will be increased to allow for a greater heap size for the running DataStage job. Note that when running the shmtest utility, great care must be taken with interpreting its output. The utility tests the availability of memory that it can allocate at the time it runs, and this will be affected both by the current uvconfig settings, how much shared memory is already in use, and other activity on the machine at the time.
Procedure
1. Enable the following administrator project level parameters for the project or the job and set them to true v APT_DUMP_SCORE v APT_PM_SHOWRSH v APT_PM_SHOW_PIDS
82
Troubleshooting Guide
v v v v v
v OSH_EXPLAIN APT_DISABLE_COMBINATION 2. Add a new user defined environment variable called DS_PXDEBUG in DS Administrator. The value must be undefined for the project. Leave that value blank or set itto 0 at the project level. Add this new environment variable to the job level and set the value to 1. The DS_PXDEBUG variable causes the job to report debugging information.
Results
Debug information is collected under a new project-level directory called Debugging. Subdirectories are created on a per-job basis, and are named after the job. Multi-instance jobs run with a non-empty invocation ID, and the directory is named with the job name and the invocation ID.
What to do next
Execute the job. Send an export of the job with the detailed job log and the project path/Debugging/Jobname folder to support.
83
84
Troubleshooting Guide
Contacting IBM
You can contact IBM for customer support, software services, product information, and general information. You also can provide feedback to IBM about products and documentation. The following table lists resources for customer support, software services, training, and product and solutions information.
Table 5. IBM resources Resource IBM Support Portal Description and location You can customize support information by choosing the products and the topics that interest you at www.ibm.com/support/ entry/portal/Software/ Information_Management/ InfoSphere_Information_Server You can find information about software, IT, and business consulting services, on the solutions site at www.ibm.com/ businesssolutions/ You can manage links to IBM Web sites and information that meet your specific technical support needs by creating an account on the My IBM site at www.ibm.com/account/ You can learn about technical training and education services designed for individuals, companies, and public organizations to acquire, maintain, and optimize their IT skills at http://www.ibm.com/software/swtraining/ You can contact an IBM representative to learn about solutions at www.ibm.com/connect/ibm/us/en/
Software services
My IBM
IBM representatives
Providing feedback
The following table describes how to provide feedback to IBM about products and product documentation.
Table 6. Providing feedback to IBM Type of feedback Product feedback Action You can provide general product feedback through the Consumability Survey at www.ibm.com/software/data/info/ consumability-survey
85
Table 6. Providing feedback to IBM (continued) Type of feedback Documentation feedback Action To comment on the information center, click the Feedback link on the top right side of any topic in the information center. You can also send comments about PDF file books, the information center, or any other documentation in the following ways: v Online reader comment form: www.ibm.com/software/data/rcf/ v E-mail: comments@us.ibm.com
86
Troubleshooting Guide
87
88
Troubleshooting Guide
Product accessibility
You can get information about the accessibility status of IBM products. The IBM InfoSphere Information Server product modules and user interfaces are not fully accessible. The installation program installs the following product modules and components: v IBM InfoSphere Business Glossary v IBM InfoSphere Business Glossary Anywhere v IBM InfoSphere DataStage v IBM InfoSphere FastTrack v v v v IBM IBM IBM IBM InfoSphere InfoSphere InfoSphere InfoSphere Information Analyzer Information Services Director Metadata Workbench QualityStage
For information about the accessibility status of IBM products, see the IBM product accessibility information at http://www.ibm.com/able/product_accessibility/ index.html.
Accessible documentation
Accessible documentation for InfoSphere Information Server products is provided in an information center. The information center presents the documentation in XHTML 1.0 format, which is viewable in most Web browsers. XHTML allows you to set display preferences in your browser. It also allows you to use screen readers and other assistive technologies to access the documentation.
89
90
Troubleshooting Guide
Notices
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to: Intellectual Property Licensing Legal and Intellectual Property Law IBM Japan Ltd. 1623-14, Shimotsuruma, Yamato-shi Kanagawa 242-8502 Japan The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web
91
sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact: IBM Corporation J46A/G4 555 Bailey Avenue San Jose, CA 95141-1003 U.S.A. Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information is for planning purposes only. The information herein is subject to change before the products described become available. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to
92
Troubleshooting Guide
IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are provided "AS IS", without warranty of any kind. IBM shall not be liable for any damages arising out of your use of the sample programs. Each copy or any portion of these sample programs or any derivative work, must include a copyright notice as follows: (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. Copyright IBM Corp. _enter the year or years_. All rights reserved. If you are viewing this information softcopy, the photographs and color illustrations may not appear.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at www.ibm.com/legal/copytrade.shtml. The following terms are trademarks or registered trademarks of other companies: Adobe is a registered trademark of Adobe Systems Incorporated in the United States, and/or other countries. IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government Commerce. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office UNIX is a registered trademark of The Open Group in the United States and other countries. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom.
93
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. The United States Postal Service owns the following trademarks: CASS, CASS Certified, DPV, LACSLink, ZIP, ZIP + 4, ZIP Code, Post Office, Postal Service, USPS and United States Postal Service. IBM Corporation is a non-exclusive DPV and LACSLink licensee of the United States Postal Service. Other company, product or service names may be trademarks or service marks of others.
94
Troubleshooting Guide
Index A
at command 9 authentication errors 5
S
schedule log dsr_sched.log 6 viewing 6 scheduled jobs 5 AIX servers 9 checking user rights 7 localizing days of week 7 testing user name and password UNIX and Linux servers 9 scheduling Windows servers 6 software services contacting 85 support customer 85
C
cron command 9 customer support contacting 85
D
Designer client handling exceptions 14 viewing error reports 15 viewing log files 15
F
Failed to authenticate 1, 2, 5 failure to authenticate user 5 failure to connect 1, 2
T
trademarks list of 91
U
UNIX and Linux configuration problems 12, 13
J
job termination problems 10
L
legal notices 91
V
viewing scheduled jobs 9
O
ODBC connections checking symbolic links 12 shared library environment 11 UNIX and Linux systems 10 ODBC drivers UNIX and Linux systems 10
W
WebSphere application server fails to start 3 on AIX and Linux 3 1, 2, 5
P
product accessibility accessibility 89 product documentation accessing 87
R
running out of file units 12 running out of memory on AIX computers 13
95
96
Troubleshooting Guide
Printed in USA
SC19-3804-00