Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Nbu Faq 3

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 12

Running Netbackup 7.1 on Win2008R2 SP1.

I've got a policy that kicks off several (20+) backups


of VMs. The parent job says Active, but all the children say "Queued". The state of all of the
children is "Drives are in use in storage unit." Detailed status on each child job says "awaiting
resource netbackup01-hcart3-robot-tld-1 - No drives are available". I've looked in the tape drive's
GUI, no tape is in the drive.

In Netbackup->Media and Device Management->Device Monitor, the tape drirve shows as


Control=TLD, Ready=No.

MDS allocations in EMM:


MdsAllocation: allocationKey=12439 jobType=2 mediaKey=4000097 mediaId=0513L3 drive
Key=0 driveName= drivePath= stuName= masterServerName=netbackup01 mediaServerName
=netbackup01 ndmpTapeServerName= diskVolumeKey=0 mountKey=0 linkKey=0 fatPipeKey
=0 scsiResType=0 serverStateFlags=0
MdsAllocation: allocationKey=12525 jobType=1 mediaKey=4000090 mediaId=00075L drive
Key=2000005 driveName=IBM.ULTRIUM-HH4.001 drivePath={3,0,1,0} stuName=netbackup0
1-hcart-robot-tld-0 masterServerName=netbackup01 mediaServerName=netbackup01 ndmpTa
peServerName= diskVolumeKey=0 mountKey=0 linkKey=0 fatPipeKey=0 scsiResType=1 server
StateFlags=1

One media allocation with no drive: mediaId=0513L3 driveKey=0 driveName=


Another media allocation with drive: mediaId=00075L driveKey=2000005 driveName=IBM. ULT
RIUM-HH4.001

The problem is this first media allocation: media allocation assumes drive allocation as well.
Release allocation for this media: nbrbutil -releaseMDS 12439

About the drive in RESTART mode:


NBU Device management service (ltid) might be restarted for the drive to become available again.

Device troubleshooting starts at OS level. If you tell us which OS, it is easier to provide advice.

To get NBU to log Media Manager actions/errors to OS, please add the following entry to vm.conf
on master and/or media server where DOWN drive is seen
(Windows: <install-path\veritas\volmgr\vm.conf Unix/Linux: /usr/openv/volmgr/vm.conf):

VERBOSE
Restart NBU. Try to UP drive in Device Monitor.

There are LOTS of possible reasons for drives being down'ed.


No use speculating here without seeing /var/adm/messages after VERBOSE logging has been ena
bled.
Generally in NBU if the drive(s) have been working correctly and nothing has been changed, then
it is very unlikely NBU is the cause, the main reason being, NBU does not write to drives, it is all
done by the OS.

Just occassionally, I find the completely removing and reconfiguring the drive brings it back to life
- and by that mean remove it from the OS and NBU completly, then put it back.

If that does not promote it back into life, then no amount of prodding, poking or tickling it under
the chin is likely to make a difference, and it is likely that it needs to go off to the 'tape drive hos
pital' for some treatment.

If it goes DOWN again, the reason will be logged in syslog on Unix/Linux server (e.g. /var/adm
/messages on Solaris, /var/log/messages on Linux,) or in Event Viewer Application log on Windo
ws server. Device errors will be logged is System log.

NBU has minimum contact with tape drives, the only thing it does is send a few scsi commands
and apart from some versions of unix/ linux, even these go via the OS, and even then, these scsi
commands are only used so NBU knows when the tape is in the drive. After that point, it's all OS
(NBU just passes the data to the OS, which then writes it onto tape).

It is for this reason, as pointed out by Marianne, that tape drive issue investigation, should start at
the os.

We can look in this file /usr/openv/netbackup/db/media/errors (or win <install>\veritas\netbackup


\db\media\errors ) and get some idea if there is any pattem to the errors on the drive or media.

If you have access to Solaris, you can download tperr.sh and run the media/errors file against it,
full instructions and download here :
https://www-secure.symantec.com/connect/downloads/tperrsh-script-solaris-only
(the errors file is on each media server).

The system log should show some detail (follow Mariaanes instructions) - if you see any thing that
mentions io_ioctl / ASC/ ASCQ or Tapealert - it is almost 100% certain you have a faulty drive).

It appears that the drive had a tape in it that it could not read the barcode. The system wasn't even
displaying the persence of the tape media...it showed one less than was actually in the library. I
removed the tape and the drive is now reading other tape media normally.

In general each media server will have a single storage unit that relates to the number of drives of
the same denisty in a library.

So in your case, assuming all drives are shared, then each media server should have hcart3 storage
unit with 8 drive and a hcart storage unit with 4 drives
You only tend to have additional ones with less drives if you want to restrict a particular media
sevrer or policy from using all of the drive, in which case you can take a copy of it and reduce its
drive count.

Do check all of your policies and schedules (and disk staging / SLPs if you use them) before
deleting any to make sure that they are not specified somewhere which would revert the policy to
fail or use "any available"

The following NetBackup Drive statuses can appear in command line output or in the Device Mon
itor in the NetBackup Administration Console.
Column Description Note for the Administration Console:
Drive Name - Drive name assigned to a drive during configuration.
Control - Control mode for a drive can be any of the following robot_designation. such as, TLD.
The robotic daemon managing the drive has connected to LTID (the device daemon and Device
Manager service) and is running. The drive is in the usable state. AVR is assumed to be active for
the drive, as all robotic drives must be in automatic volume recognition (AVR) mode (not OPR
mode).
Applies only to robotic drives.

DOWN-<robot_designation>
For example, DOWN-TLD.
The drive is in an usable state because it was downed by an operator or by NetBackup; or when
the drive was configured, it was added as a down drive. Applies only to robotic drives.

DOWN
In this mode, the drive is not available to Media Manager.
Applies only to standalone drives.
A drive can be in a DOWN mode because of problems or because it was set to that mode using
Actions | Down Drive.

PEND-<robot_designation>
For example, PEND-TLD.
The drive is in a pending status. Applies only to robotic drives.

PEND
The drive is in a pending status. Applies only to standalone drives.
If the drive reports a SCSI RESERVATION CONFLICT status, this column will show PEND. This
status means that the drive is reserved when it should not be reserved. Some server operating
systems (Windows, Tru64, and HP-UX) may report PEND if the drive reports Busy when opened.
You can use the AVRD_PEND_DELAY entry in the Media Manager configuration file to filter out
these reports.
AVR (AVR)
The drive is running with automatic volume recognition enabled.
The drive is in a usable state with automatic volume recognition enabled, but the robotic daemon
managing the drive is not connected or is not working. Automated media mounts do not occur
with a drive in this state (unless the media is in a drive on the system, or, this is a standalone tape
drive), but the operator can physically mount a tape in the drive or use robtest to cause a tape
mount as needed.

OPR
The drive is running in a secure mode where operators must manually assign mount requests to the
drive. AVRD is not scanning this drive when in this mode. This mode gives operators control over
which mount requests can be satisfied.
Applies only to standalone drives.

NO-SCAN
A drive is configured for shared storage option (SSO), but has no available scan host (to be
considered available, a host must register with a SSO_SCAN_ABILITY factor of non-zero and
have the drive in the UP state). NO-SCAN may be caused if all available scan hosts have the drive
in the DOWN state. Other hosts (that are not scan hosts) may want to use the drive, but they
registered with a scan factor of zero. The drive is unusable by NetBackup until a scan host can be
assigned.

Mixed
The control mode for a shared drive may not be the same on all hosts sharing the drive. For shared
drives, each host can have a
different status for the drive. If the control modes are all the same, that mode is displayed.

RESTART
The control mode for a shared drive may not be the same on all hosts sharing the drive. This status
indicates that ltid needs to be restarted. To determine what server need to be restarted, right-click
the drive in the device monitor and select up. This will tell you what servers that ltid needs to be
restarted.

Additional information for the columns that can appear in the Administration console can be
found in the NetBackup 6.0 Media Manager Guides and NetBackup 6.5 Troubleshooting Guide.
These documents can be found linked below in the Related Documents section.

1.AVR (AVR)
AVR sgscan job
monitor mounting mount

NBU
NBU NBU
but the operator can physically mount a tape in the drive or use robtest to cause a tape mount as
needed

AVR mode means that NetBackup can not talk to the robotics of the library. It could mean that
"tldcd" is not running or the SCSI pass-through connection to the robot is not working. First look
in the system logs to see if there is anything obvious there. Secondly create /usr/openv/volmgr/
debug/daemon and /usr/openv/volmgr/debug/tpreq directiories on the media servers (or the equiva
lent folder on windoze boxes). If you find something obvious, then you can fix it. If not send the
relevant information back to this forum and we will try to help

2.active
active
active lable

media error jnbSA


active
NBU

3.
code:800 NetBackup 6.0 jobs are failing with Status 800
(resource request failed). down

1.
netbackup stop
netbackup start
bpps -x # NetBackup tlddtlcdacsssiacsdacsseinbjm

2.
/usr/openv/volmgr/debug/daemon/ vmd
14:34:45.043 [524712] <4> vmd: INITIATING #vmd
14:34:45.052 [524712] <2> mm_getnodename: cached_hostname NetBackup_Master, cached_m
ethod 3
14:34:45.098 [524712] <2> mm_getnodename: (3) hostname NetBackup_Master (from mm_mast
er_config.mm_server_name)
14:34:45.098 [524712] <4> vmd: Host name is NetBackup_Master
14:34:45.265 [524712] <2> EndpointSelector::EndpointSelector: set RI to: NetBackup_Master
(Endpoint_Selector.cpp:166) # PBX
14:34:45.304 [524712] <2> Orb::init: initializing ORB Default_CLIENT_Orb with: vmd ORB
SvcConfDirective -ORBDottedDecimalAddresses 0 -ORBSvcConfDirective
static PBXIOP_Factory -enable_keepalive' -ORBSvcConfDirective static EndpointSelector
Factory -ORBSvcConfDirective static Resource_Factory
-
14:35:35.332 [524712] <16> emmlib_initializeEx: (-) Exception! CORBA::TRANSIENT
14:35:35.333 [524712] <16> emmlib_initializeEx: (-) system exception, ID IDL:omg.org/CORB
A/TRANSIENT:1.0
OMG minor code (2), described as *unknown description*, completed = NO # nbemm
nbemm /usr/openv/db/bin/nbdb_ping emm

14:35:40.333 [524712] <16> EmmInit: (-) Translating EMM_ERROR_CorbaTransient(3000001)


to 195 in the Media context
14:35:40.346 [524712] <16> vmd: EMM interface initialization failed, status = 195 # EMM

14:35:40.347 [524712] <2> da_startup_recovery_complete: vmd/DA entering normal operating


mode

PBX unified
11/23/10 14:34:58.881 [EndpointSelector::EndpointSelector] set RI to: NetBackup_Master(End
point_Selector.cpp:166)
11/23/10 14:34:58.900 [Orb::init] endpointvalue is : pbxiop://1556:nbsvcmon(Orb.cpp:609)
11/23/10 14:34:58.901 [Orb::init] initializing ORB nbsvcmon with: nbsvcmon
-ORBSvcConfDirective -ORBDottedDecimalAddresses 0 -ORBSvcConfDirective static PBX
IOP_Factory -enable_keepalive' -ORBSvcConfDirective static

Resource_Factory -ORBConnectionCacheMax 1024 -ORBEndpoint pbxiop://1556:nb svc mon


-ORBSvcConf /dev/null -ORBSvcConfDirective static Server_Strategy_Factory -ORB
MaxRecvGIOPPayloadSize 268435456(Orb.cpp:720)
11/23/10 14:35:57.877 [Critical] V-137-6 Failed to initialize ORB: check to see if PBX is running
or if service has permissions to connect to PBX. Check PBX logs for details
11/23/10 14:35:57.877 V-137-103 [Orb::init] CORBA exception: system exception, ID IDL:omg.
org/CORBA/BAD_PARAM:1.0
TAO exception, minor code = 1 (endpoint initialization failure in Acceptor Registry;
ETIMEOUT), completed = NO
during orb activation
11/23/10 14:35:57.880 [Critical] V-118-49 failed to initialize nbrb service
11/23/10 14:35:57.892 [main] CORBA exception: BAD_PARAM (IDL:omg.org/ CORBA
/BAD_PARAM:1.0) system exception, ID IDL:omg.org/ CORBA/BAD_PARAM:1.0
TAO exception, minor code = 1 (endpoint initialization failure in Acceptor Registry;
ETIMEOUT), completed = NO
# unified NetBackup PBX PBX
PBX BAD_PARAM
3.
pbx
/opt/VRTSpbx/bin/vxpbx_exchanged restart
NetBackup NetBackup PBX
NetBackup

media server media server


DOWNPENDOPRMIXEDRESTART media server

DOWN-<robot_designation>
For example, DOWN-TLD.
The drive is in an usable state because it was downed by an operator or by NetBackup; or when
the drive was configured, it was added as a down drive. Applies only to robotic drives.

media server media server vmopr


cmd -up <drive_index>

PEND-<robot_designation>
For example, PEND-TLD.
The drive is in a pending status. Applies only to robotic drives.
Pend-tld

OPR
The drive is running in a secure mode where operators must manually assign mount requests to the
drive. AVRD is not scanning this drive when in this mode. This mode gives operators control over
which mount requests can be satisfied.
Applies only to standalone drives.

vmoprcmd down <index_number>, vmoprcmd up <index_number>

Mixed
The control mode for a shared drive may not be the same on all hosts sharing the drive. For shared
drives, each host can have a different status for the drive. If the control modes are all the same,
that mode is displayed.

media server media server

RESTART
The control mode for a shared drive may not be the same on all hosts sharing the drive. This status
indicates that ltid needs to be restarted. To determine what server need to be restarted, right-click
the drive in the device monitor and select up. This will tell you what servers that ltid needs to be
restarted.

RESTART media server media server


PBX

Pending-TLD
Pending-TLD
Pending-TLD
If requests await action or if NetBackup acts on a request, the Pending Requests pane appears. For
example, if a tape mount requires a specific volume, the request appears in the Pending Requests
pane. If NetBackup requires a specific volume for a restore operation, NetBackup loads or
requests the volume. After all requests are resolved (automatically by NetBackup or manually by
operator intervention), the Pending Requests pane disappears.

Resubmitting a
request Denying a request

1 23socket read failed.

2. mastermediaclient host

3. /usr/openv/netbackup/bin/bpclntcmd
3.1. bpclntcmd -self # NetBackup
[root@rh5 ~]# hostname
rh5
[root@rh5 ~]# bpclntcmd -self
yp_get_default_domain failed: (12) Local domain name not set
NIS does not seem to be running: (1) Request arguments bad
gethostname() returned: rh5
host rh5: rh5 at 192.168.11.4 (0x40ba8c0)
aliases:
127.0.0.0.1

3.2. bpclntcmd -hn media/master hostname # master media server


[root@rh5 ~]# bpclntcmd -hn rh5
host rh5: rh5 at 192.168.11.4 (0x40ba8c0)
aliases:

3.3. bpclntcmd -ip media/master IP # master media server IP


[root@rh5 ~]# bpclntcmd -ip 192.168.11.4
checkhaddr: host : rh5: rh5 at 192.168.11.4 (0x40ba8c0)
checkhaddr: aliases:
3.4. bpclntcmd -pn # master
[root@rh5 ~]# bpclntcmd -pn
expecting response from server rh5
rh5 rh5 192.168.11.4 37865

4. media server bpclntcmd


4.1. telnet <client name> 13782
[root@rh5 ~]# telnet rh5 13782
Trying 192.168.11.4
Connected to rh5 (192.168.11.4).
Escape character is ^].

telnet media
server client

4.2. /usr/openv/netbackup/bin/admincmd/bptestbpcd media client


bptestbpcd -client <client hostname> # media server client
[root@rh5 ~]# bptestbpcd -client rh5 -verbose
111
192.168.11.4:53709 -> 192.168.11.4:13724
192.168.11.4:53376 -> 192.168.11.4:13724
PEER_NAME = rh5
HOST_NAME = rh5
CLIENT_NAME = rh5
VERSION = 0x06500000
PLATFORM = linuxR_x86_2.6
192.168.11.4:34809 -> 192.168.11.4:13724
media server client bpclntcmd

5.
/usr/openv/netbackup/logs/ bpcd bptestbpcd -client <client hostn
ame> media server client
[root@rh5 ~]# bptestbpcd -client rh5 -verbose
06:39:23.270 [17401] <2> bpcd main: offset to GMT -28800
06:39:23.270 [17401] <2> logconnections: BPCD ACCEPT FROM 192.168.11.4.41263 TO
192.168.11.4.13724
06:39:23.270 [17401] <2> bpcd main: setup_sockopts complete
06:39:23.273 [17401] <2> bpcd peer_hostname: Connection from host rh5 (192.168.11.4) port
41263
06:39:23.273 [17401] <2> bpcd valid_server: comparing rh5 and rh5
06:39:23.274 [17401] <4> bpcd valid_server: hostname comparison succeeded
06:39:23.274 [17401] <2> bpcd main: output socket port number = 1
06:39:23.362 [17401] <2> bpcd main: Duplicated vnetd socket on stderr
06:39:23.362 [17401] <2> bpcd main: <- NetBackup 6.5 0 initiated
06:39:23.362 [17401] <2> bpcd main: VERBOSE = 0
06:39:23.362 [17401] <2> bpcd main: Not using VxSS authentication with rh5
06:39:23.403 [17401] <2> bpcd main: BPCD_GET_STDOUT_SOCKET_RQST
06:39:23.403 [17401] <2> bpcd main: socket port number = 1
06:39:23.490 [17401] <2> bpcd main: Connected on output socket
06:39:23.491 [17401] <2> bpcd main: Skipping shutdown of send side of stdout.
06:39:23.491 [17401] <2> bpcd main: Duplicated socket on stdout
06:39:23.492 [17401] <2> bpcd main: BPCD_DISCONNECT_RQST
06:39:23.492 [17401] <2> bpcd exit_bpcd: exit status 0 >exiting

bpcd

NetBackup active monitor cancel NetBackup


active monitor
1.
NetBackup
job id job id
NetBackup
job hung cancel

2.
job id
NetBackup
NetBackup netbackup\db\jobs
bpjobd.act.db
restart job id
trylogs job id
ffilelogs job id

3. NetBackup

I am trying to decommission a media server, for which I used nbdecommission command but
failed. Later I tried removing the host manually.
I do not see any STU/STUG/SLP/media/drives/robots assigned/configured to this media server.
Master server is at 7.5.0.3
Media server is at 7.1.0.4
# vmglob -listall -b | grep -i mediaserver
#

# bpmedialist -mlist -h mediaserver


#
nbemmcmd -deletehost -machinename mediaserver -machinetype media
NBEMMCMD, Version: 7.5.0.3
The function returned the following failure status:
the media is allocated for use (199)
Command did not complete successfully.

Finally I was able to delete the media server from configuration. Symantec support suggested to
increase the logging and asked to retry the nbdecommission command to capture the logs.
But fortunately/unfortunately this time nbdecommission command exited successfully and
couldn't find root cause. Following are the instructions/steps suggested by Symantec engineer.
1) Verify that the following logs are in place: /usr/openv/nebackup/logs/admin -> if the admin
directory does not exist create it. /usr/openv/logs/nbemm -> if the nbemm directory does not exist
create it.
2) Then open putty session to themaster server and cd to the /usr/openv/netbackup/bin and enter
this command: vxlogcfg -a --prodid 51216 -o all -s DebugLevel=6 -s DiagnosticLevel=6
3) Ensure that legacy logging is turned up on master server: - Open the Netbackup Administration
GUI. - Go to the host properties of master\media server. - Select the logging tab. - Uncheck robust
logging and set global logging value to 5. - stop\start the NetBackup daemons/services.
4) At this point try to delete the media server from EMM or use the nbdecommission utility to
decommission the media server.

no media are assigned to this media currently. Respective o/p is already posted.

Still unable to remove it from EMM. By the way I used nbdecommission command to decom this
media server which done all the work but failing to dete entry from EMM.

Make sure you did followed the below steps..


You need to move the the assing tapes to another media server before you go and delete the media
server,
Condition:-both new and old meida servers showld have the same robot contor drives.
Follow below commands in order
bpmedia -movedb -allvolumes -oldserver <old_server> -newserver<new_server>
nbemmcmd -deletealldevices -machinename -machinetype media
nbemmcmd -deletehost -machinename -machinetype media
remove from bp.conf of master or use media server decomminsion tool

http://www.symantec.com/business/support/index?page=content&id=TECH130438

See if there are images assigned to this media server :


bpimagelist -media -idonly -server <media-server> -s 1/1/2005

Check resource allocation as well:


nbrbutil -dump |grep <mediaservername>
If it shows any allocations.. use nbrbutil -resetmediaserver <mediaservername> then try remo
ving..

Suggest to go with clean restart of netbackup master server and they try removing the media
servers.

Make sure your restart of netbackup is releasing all the allocations before stoping netbackup.. do
/usr/openv/netbackup/bin/admincmd/nbrbutil -resetall
/usr/openv/netbackup/bin/admincmd/nbrbutil -dump|wc -l ---> it should return 9 , if not use the
reset command again.. untill you get 9
and then stop the netbackup. and start it agian..

Problem
Problem occurs when allocation of media or tape drive is present in NBRB and other jobs are
getting affected.

Cause
Master Server:
1. nbrbutil -resetMediaServer "media_server name"
2. nbrbutil -dump This shows the media in question is still reserved in the drive.
3. nbrbutil -releaseMDS <MediaKey> (Shows no error)
4. nbrbutil -dump (This shows the media is still reserved.)

Solution
To manually remove the files associated with the media and drive, from the corresponding media
server perform the following:
1. cd /usr/openv/netbackup/db/media
2. Remove <db.lockfile>
3. cd /usr/openv/netbackup/db/media/drives
4. Remove the drive which reserved the Media
5. Run: /usr/openv/netbackup/db/media/tpreq
6. Remove the drive in question.
7. Stop NetBackup services.
8. cd /usr/openv/volmgr/misc
9. Remove all lock files.
10. Start NetBackup services.

Applies To
1. Verify the media is not in the drive using /usr/openv/volmgr/bin/robtest.
2. Cancel all the jobs running on Mediaserver.

You might also like