Netbackup Tape Troubleshooting
Netbackup Tape Troubleshooting
Netbackup Tape Troubleshooting
Problem
Troubleshooting Drive/Library Issues in NetBackup
This document provides information on, and how to resolve, various tape drive
issues that may be encountered whilst using NetBackup.
Solution
It is important to understand that NetBackup does not write data directly to a tape
drive. For example: when using Solaris, NetBackup relies on the operating system to
write the data to the tape using the st tape driver. The only slight involvement with
NetBackup is that it specifies the block size to use - but this is still passed to the
operating system. Other operating systems work in a similar manner.
The SCSI pass-through driver (sg driver on Solaris) - allows SCSI commands to be
passed directly to the drive. For example, the 'test-unit-ready' SCSI command is
used, for example, when mounting a tape. On occasion, it is necessary to
recreate/rebuild the pass-through driver. The most common symptom that involves
the pass-through driver is if the scan command does not show all expected
devices. Other issues involving the pass-through driver are very rare.
The majority of drive/tape issues have a cause outside of NetBackup. When
troubleshooting these issues it is advisable to start the troubleshooting process at
the hardware/firmware level.
It should always be considered that although NetBackup reports an error, it does not
mean that NetBackup is the cause.
Common drive issues include:
Scan command
TAPE_ALERT
ASC/ASCQ
Missing Path
Positioning errors
Read/ Write errors
I/O Errors
External event has caused rewind
Tapes not reaching capacity (for example) 300GB of Data is written to a 400GB
(native capacity) capacity tape
Tapes being incorrectly marked as 'read only'
Library Inventory Issues
Robot load issue - "Error bptm error requesting media TpErrno = Robot operation
failed"
Missing drives, or drives disappearing and reappearing
Tapes failing to mount in NetBackup, but visable and usable by operating system
commands
Issues moving tapes to/ from slots or drives
Issues with Cartridge memory
Cleaning tape
In the first instance, it is always worth power cycling the library or drives reporting an
issue, as well as rebooting the associated servers, Many of the errors referenced in
this tech note can be sometimes be cleared this way. In the event this does not clear
the issue, it has at least been eliminated from being the cause.
Scan Command
Scenario: The scan command shows no devices at all, or, that some of the devices,
or all of the devices appear and reappear when the command is run repeatedly.
Firstly, it must be confirmed that the operating system can see and communicate
correctly with the tape drives.
The devices appearing in (for example) 'Device Manager' (Windows) or cfgadm
(Solaris) is NOT necessarily sufficient confirmation that the devices are correctly
configured to the operating system.
It has been seen that although devices appear to be visible to the operating system,
SAN issues prevented full/correct communication, and as a result, the scan
command failed.
Two things need to be checked before further troubleshooting is carried out:
1. Ensure no backups are running on the drives (only applicable if the drives are
shared). A SCSI reservation of a drive due to a backup may prevent the drive from
responding to, and thus appearing in the output of the scan command.
2. Rebuild the passthrough driver (Unix only). If the drive/operating system
configuration has not changed, then this is very unlikely to be the issue. However, it
can be eliminated from being the cause by recreating the passthrough links and files.
See the device configuration guide for information on how to do this.
Aside from the exceptions, above issues with the scan command are not caused by
NetBackup. When it is understood how the scan command works, it is clear how the
root of the issues are external to NetBackup.
Although the scan command is supplied by Symantec, it does not issue any
NetBackup commands, or interact with NetBackup in any way. When run, it issues
operating system level SCSI commands to the devices configured in the operating
system, and the output of the command is sent from the devices themselves. There
are no settings, tuning or troubleshooting that can be performed on the scan
command.
Windows servers do not require a passthrough driver. Providing that there are no
backups running on other servers that may share the drives, then the problem will be
caused by either an issue regarding the SAN, firmware, hardware or drivers.
Consideration should be given to SAN infrastructure (e.g. switches), HBAs or the
physical drive/library.
Unix servers require a passthrough driver. For example, on Solaris this is called the
sg driver. This is required as the SCSI commands issued to query the device cannot
be passed to the devices via the regular operating system driver.
If the scan command shows devices appearing and re-appearing, then the
passthrough driver is not the cause. If the device(s) permanently disappear, it may
be worth reconfiguring the passthrough driver. If the issue is not resolved, then the
issue will be as per Windows servers, that is, SAN infrastructure (e.g. switches),
HBAs or the physical drive/library. Consideration should also be given to HBA
configuration files, as incorrect settings in these have been seen to prevent output
from the scan command being returned.
A tape alert message is a critical, warning, or informational alert that occurs due to a
tape drive or robotic library hardware event. These "tape alert" messages are stored
on the tape drive or robotic library. Applications like NetBackup query the tape device
or robotic library for these "tape alert" messages and display the "tape alerts" to the
user. "Tape alert" messages are reported in the NetBackup bptm log. The tape alert
technology detects and logs hardware and media errors.
It is important to remember that while NetBackup displays these "tape alerts", the
alerts occur due to a tape drive or robotic library hardware event. Check the Event
Viewer/system log for any hardware related errors. Contact the Original Equipment
Manufacturer (OEM) for support.
As a TapeAlert is sent from the drive itself, it is impossible that this can be caused by
NetBackup.
For example:
Oct 11 08:59:31 media bptm[3771]: [ID 228150 daemon.warning] TapeAlert
Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive TLD0_LTO4_DRIVE1
(index 4), Media Id R0TP01
ASC/ ASCQ
SCSI Sense keys describe a 'state', which are returned when a command requests
a 'check condition' status.
In this example, robtest was failing to load a tape into a drive.
Initiating MOVE_MEDIUM from address 1000 to 500
move_medium failed, CHECK CONDITION
sense key = 0x5, asc = 0x30, ascq = 0x0, INCOMPATIBLE MEDIUM INSTALLED
The analysis can be broken down as follows :
Sense Key 0x5 - Illeagal Request
ASC/ASCQ 0x30/00 - Incompatible Medium Inserted
In a similar manner to Tape Alerts, SCSI Sense Keys are produced by the device,
not by NetBackup.
As ASC/ASCQ alerts are sent from the hardware, it is impossible for them to be
caused by NetBackup.
It has been seen that a power cycle of the drive (not soft reset) can sometimes clear
ASC/ASCQ errors.
Further information on these values can be found at http://www.t10.org
To further investigate ASC/ASCQ issues, Symantec recommends contacting your
hardware vendor.
Note
If hardware encryption is in use via NetBackup KMS, an issue with the service may
cause the drives to send out ASC/ASCQ errors relating to encryption. In this
instance, although the drive is sending he message, the cause may be the KMS
service, and so this should be given consideration.
Missing Path
Missing path means that the Operating System has lost connectivity to the drives. At
this point, you will find that the devices are also missing from the scan output, simply
because the scan command only communicates with devices found at the Operating
System level.
For this issue, NetBackup is not the cause, however, when the issue is resolved it
may be the case that the paths to the devices change, thus making the NetBackup
config incorrect. If this is the case, the devices will need to be deleted and
reconfigured within NetBackup. If the devices come back with the same operating
system paths, then no further action should be required.
Positioning Errors
Positioning errors occur when the operating system is unable to position, fastforward or rewind the tape.
The error message seen may differ slightly, depending on when the error occurs.
Example 1
<2> write_data: block position check: actual 62504, expected 31254
Example 2
1/11/2010 7:50:13 AM - Error bptm(pid=3364) ioctl (MTREW) failed on media
id W00229, drive index 0, The I/O bus was reset. (1111) (bptm.c.8039)
NetBackup requests the operating system to position the tape, at various points of
the backup. Failure to correctly position, although detected by NetBackup, is most
commonly caused by:
1.
2.
3.
4.
Hardware error
Tape error
Driver issue
Firmware issue
a) One known issue can be seen in the bptm log, affecting NBU 6.5.6 to 7.0.1.
Error bptm (pid=2164) ioctl (MTWEOF) failed on media id V01497, drive index
0, The physical end of the tape has been reached.
Note
a) McAffee Anti_virus software is known to be a possible cause of Status 84 errors
on Windows Media Servers.
b) Cyclic redundancy check errors indicate faulty hardware.
c) MSEO is not compatible with Asynchronous Tapemarks which were introduced in
NetBackup 7.1 Symptoms include write and/or read errors on tapes encrypted with
MESO. Creating the empty file '
I/O Error
I/O errors are caused at a hardware level, and are only detected by NetBackup.
For example:
11:20:18.246 [8504.5292] <4> write_data: WriteFile failed with: The request could
not be performed because of an I/O device error. (1117); bytes written = 65536; size
=0
To further investigate I/O Errors, Symantec recommends contacting your hardware
vendor.
Known issues
This issue is (potentially) serious and requires immediate investigation, as data can
be lost. NetBackup will display this error if the block position calculation check by
NetBackup does not match the position reported by the drive. It will not be certain
that a full rewind has occurred (impossible to tell from a simple block check), but it
will mean that the position check has failed, and most likely that the calculated
position is less than the expected position.
The error will look similar to the following:
<2> io_terminate_tape: block position check: actual 4, expected 5
<16> write_data: FREEZING media id XXXXXX, External event caused rewind
during write, all data on media is lost
NetBackup keeps track of how much data it is sending to the operating system to
write to the device. NetBackup will ask the tape device for its position as an integrity
check after the end of each write. If this position does not match what NetBackup
has calculated the position should be, then the job will fail with a media write error.
If a full rewind has occurred, this will overwrite the NetBackup header on the tape,
making it unreadable. If this has happened, the data on the media is lost. The most
common cause of this is a SCSI reset on the SAN, which causes a rewind of the
drive(s) whilst they are being written to. This event is undetectable by NetBackup,
and is only discovered after the event, when the block position check is made.
NetBackup cannot cause SCSI resets on the SAN because the tape positioning and
read/write operations are all controlled by the Operating System itself.
If the issue is a position error (as opposed to a 'Full' rewind) a message similar to the
following will be seen upon inspection of the bptm log.
<2> write_data: block position check: actual 62504, expected 31254
<16> write_data: FREEZING media id XXXXXX, too many data blocks written, check
tape/driver block size configuration
The possible causes are numerous, and most commonly include:
Tape driver issue
Tape drive firmware issue
SAN fault
HBA driver or firmware issue, or other fault
Switch Fault
If the drives are attached to a NDMP device, it must be ensured that the SCSI
reservation on the NDMP device is set to match the SCSI reservation type of
NetBackup.
To further investigate "External event has caused rewind" issues, Symantec
recommends contacting your hardware /operating system support vendor.
Note
The SCSI reservation is set/held by the Host Bus Adaptor. However, NetBackup
sends the reserve command through the SCSI pass-thru path for the device, so this
needs to be configured correctly.
Known Issues:
NDMP
If the issue is occurring on drives that are shared (SSO) between an NDMP filer and
NetBackup, and the drives are zoned directly to the filer, then the issue can be
caused if the SCSI reservation type set in NetBackup is not the same as the SCSI
reservation type set on the filer.
If this is the case the issue can be resolved by following these steps :
In the 'Host Properties' > 'Media Type' tab in NetBackup, check the SCSI reservation
set, SPC2 or SCSI persistent
Change the type of SCSI reservation on the filer, to match the type you have set in
NetBackup.
Reboot the Robotic Library to break all the current reservation.
The following TechNote has a detailed explanation of SCSI reservation:
http://www.symantec.com/docs/HOWTO32767
Scenario: BPTM block position check fails one block short using IBM atdd driver
6.0.0.96 on HP-UX 11.31 IA64
This issue is actually caused by the HP ATDD driver writing the EOT mark
incorrectly. However, Symantec has produced a NetBackup 7.0.1 EEB to
workaround this issue (ETrack 2142743 /TECH155113)
Using the ATDD driver with NetBackup 7.0.1 and later on HP-UX 11.31 IA64 requires
atdd driver 6.0.2.8 or later. Upgrade to the new ATDD driver resolves the issue.
NetBackup has no understanding of 'read only'. This state is set by the tape drive,
usually by means of a small, physical switch on the tape cartridge.
Therefore, if a tape is being reported as 'read only' this issue cannot be the fault of
NetBackup.
'Read only' is reported by the firmware of the tapedrive, and logged by NetBackup,
we see this as a Tapealert :
0x09: 'Cartridge write protected
It has been seen on occasion that firmware issues of the tape drive have caused
tape media to be incorrectly reported as read only.
NetBackup does not directly 'Inventory' a library. Instead it queries the library and
waits to be told what tapes (via their barcodes) are located in which element address
(slots/drives). If, for example, NetBackup cannot 'see' a particular cartridge(s) it is
because the library is 'hiding' the location, not because of any setting within
NetBackup.
For example, common symptoms of library issues include tapes appearing in the
incorrect/wrong slot, and tapes/slots not appearing at all. It is impossible for this to
be caused by NetBackup.
To further investigate Library issues, Symantec recommends contacting your
hardware vendor.
Note
Issues involving NetBackup and the Virtual I/O slots on the IBM 3500 series libraries
where ALMS/Virtual I/O are enabled are occasionally seen.
Problems involving Virtual I/O slots cannot be caused by NetBackup because there
are no settings in NetBackup that can influence the behavior of the Virtual I/O slots.
It has been found that the library setting "Queued Exports" should be set to 'HIDE'
from within the IBM web console to allow tapes to be moved from the virtual I/O slots
to the slots within the logical library.
Robot load issue - "Error bptm error requesting media TpErrno = Robot
operation failed"
This error is seen in the bptm log, and depending on the logging set, may be
referenced in the .../volmgr/debug log, and possibly also the operating system event
log.
An excellent way to check this is to use the robtest command. A link to a TechNote
for documentation on Robtest is available at the end of the TechNote.
The robtest command does not issue any NetBackup commands. It only sends
operating system level SCSI commands to the library, and the output seen from the
command is sent from the library firmware. Given this description, it is clear to see
that Robtest failures cannot be caused by NetBackup.
For example, this robtest command issues a move media request from slot 86 to
drive 2:
m s86 d2
move_medium failed
sense key = 0x4, asc = 0x15, ascq = 0x1, MECHANICAL POSITIONING ERROR
As robtest has only sent a SCSI move request, straight away this failure can be seen
to not be caused by NetBackup.
Further, the error is referencing an 'ASC/ASCQ' error, which, as explained in the
"ASC/ASCQ" section of this tech note, is never caused by NetBackup.
To further investigate robotic operation issues, Symantec recommends involving the
Library's vendor.
Cleaning Tape
An unusual issue has been seen at NetBackup 7.5. On occassion, a cleaning cycle
run by NetBackup will fail.
The symptoms may differ slightly :
A. The tape cannot be unloaded, the /var/adm/message log will show:
Mar 14 12:49:38 server02 tldcd[19756]: [ID 559682 daemon.notice] TLD(2)
closing/unlocking robotic path
Mar 14 12:49:38 server02 tldcd[9536]: [ID 919746 daemon.notice] inquiry() function
processing library ADIC Scalar i2000 607A:
The cause of this issue is due to the 'access bit to be set to 1' on the tape drive
The issue is resolved with EEB 2714761