NetBackup 7.5 Best Practice - Using Storage Lifecycle Policies
NetBackup 7.5 Best Practice - Using Storage Lifecycle Policies
NetBackup 7.5 Best Practice - Using Storage Lifecycle Policies
5 Best Practice
Managing backups, snapshots, duplication and replication including Auto Image Replication and Replication Director using Storage Lifecycle Policies
This paper describes the best practices around using Storage Lifecycle Policies, including the Auto Image Replication feature, and supersedes the previous best practice papers, TECH153154 and TECH75047for NetBackup 7.5 and higher versions of NetBackup. If you have any feedback or questions about this document please email them to IMG-TPM-Requests@symantec.com stating the document title.
This document is provided for informational purposes only. All warranties relating to the information in this
1.0
Table of Contents Changes to Storage Lifecycle Policies in NetBackup 7.5 NetBackup Duplication Best Practices
Plan for duplication time Use OpenStorage devices rather than VTLs Use Maximum I/O streams per volume with Disk Pools Be conservative when using storage unit groups with Media Server Load Balancing
1 1
1 1 2 2
2
2 3 3 4 4 5 6 6 6 6 7 7 8 8 9 9
Minimize contention for tape drives Provide an appropriate number of virtual tape drives for duplication Avoid sharing virtual tape drives Preserve multiplexing Use Inline Copy to make multiple copies from a tape device (including VTL) Improving duplication speed of small images to tape
10 10 11 11 11 11
12
12 13 13 14 14 14 14
Considerations for using Replication Director The LIFECYCLE_PARAMETERS file Reporting on Storage Lifecycle Policies
SLP Status Report Reporting on Auto Image Replication Activity The SLP Backlog Report
14 15 17
17 18 19
ii
Be conservative when using storage unit groups with Media Server Load Balancing
Using the Media Server Load Balancing option on storage unit groups can negatively affect Resource Broker performance. The more storage units and the more media servers represented by the storage unit group, the harder the Resource Broker needs to work to find the best resource matches while trying to pair the best source devices with the best destination devices. As long as a job sits in the queued state in the Activity Monitor, waiting for resources, the search is run for every job using the storage unit group, every time through the Resource Broker work list. The Resource Broker continues to run this search for each evaluation cycle and for each such job until resources are found. Be conservative with storage unit groups by limiting the number of storage units and the number of media servers represented by the storage unit group. If you add a media server or storage unit to the storage unit group, pay attention to how it affects performance.
Consider factors such as whether additional hardware is needed for the increased duplication load and whether or not duplication is necessary in each case. For example, if you do not currently duplicate a set of backups using Vault, perhaps they do not need to be duplicated with SLPs. Always consider the additional stress that an increase in duplication may place on your environment.
Under normal operations, how soon do backups need to be fully duplicated? What are your Service Level Agreements? Determine a metric that works for the environment. Is the duplication environment (that includes hardware, networks, servers, I/O bandwidth, and so on) capable of meeting your business requirements? If the SLPs are configured to use duplication to make more than one copy, do the throughput estimates and resource planning account for all of those duplications? Do you have enough backup storage and duplication bandwidth to allow for downtime in your environment if there are problems? Have you planned for the additional time it will take to recover if a backlog situation does occur? After making changes to address the backlog, additional time will be needed for duplications to catch up with backups.
Page 4
Delaying duplications is a good way of clearing a backlog caused by a temporary lack of resources (for example if a tape library is down), stopping work being queued while new resources are being installed and allowing more urgent duplications to be processed ahead of older, less urgent ones. It does not solve the problem of a continuously growing backlog caused by having more duplications than the available resources can manage. For a continuously growing backlog, the best solution is to cancel the older duplications and decrease the amount of duplication work by modifying either the backup policies or the SLPs. As with SLP monitoring there are two commands that can be used here: nbstlutil inactive using different qualifiers this command can be used to delay pending duplications by suspending processing for a particular SLP (-lifecycle), storage operation ( destination) or image (-backupid). Once this command is issued no further duplication work is queued up for the SLP, storage operation or image until the corresponding nbstlutil active command is issued. Setting an inactive state simply delays the processing and does not resolve backlog issues directly. This command needs to be used in conjunction with other actions (increasing duplication resources, reducing the amount of duplication activity or canceling other tasks in the backlog) to resolve the backlog. Note: Simply setting an SLP or storage operation to inactive will not stop duplication requests from being added to the pending list and once it is set to active again these will be processed and queued. This may result in a further backlog being created if there are insufficient resources to process the requests that have built up while the SLP or storage destination was inactive. This command should only be used as part of a broader strategy to address backlog by either increasing the duplication resources available (for example adding more devices) or reducing the duplication workload (for example decreasing the number of copies an SLP creates). nbstlutil cancel using different qualifiers this command can be used to cancel pending duplications for a particular SLP (-lifecycle), storage destination (destination) or image (-backupid). Note that using this command means that the pending duplication jobs will never be processed but will be discarded instead. Canceling the processing reduces the backlog quickly, but it may not be the best option in your environment. Note: If you chose to cancel processing, be aware that NetBackup regards a cancellation as a successful duplication. Source images that are set to expire on duplication" will be expired when all the required duplication operations are successful (i.e. have completed successfully or been canceled). Canceling a duplication operation does not extend the retention period of the source copy to reflect that of the canceled target copy. Nor does it enable the source to be duplicated to a target further down the hierarchy. By canceling a planned duplication you may shorten the retention period of the backup. If you plan to reduce the backlog by working on it at the individual backup level you should script the process, using the nbstlutil stlilist image_incomplete U command to identify the backup IDs of entries and passing those backup IDs to the nbstlutil cancel, inactive and active commands. The nbstlutil commands only tell the Storage Lifecycle Manager service (nbstserv) to stop processing the images (to stop creating new duplication jobs for the images). The nbstlutil command does not affect the duplication jobs that are already active or queued. Consequently, you may also need to cancel queued and active duplication jobs as well to release resources.
As long as a job sits in the queued state in the Activity Monitor, waiting for resources, the search will be run for every such job from each time through the Resource Broker work list. If there are many jobs in the work list that need to find such resources, the performance of the Resource Broker suffers. Be conservative with storage unit groups by limiting their use on destinations that are candidates for Inline Copy within a single SLP. If you must use them this way, limit the number of storage units and the number of media servers represented by the storage unit group. If you add a media server or a storage unit, pay attention to how it affects Resource Broker performance. This is a best practice for all backup and duplication operations and is not specific to SLPs.
Page 6
In releases prior to 6.5.5, NetBackup did not automatically attempt to use the same media server for both the source and the duplication destinations. To ensure that the same media server was used, the administrator had to explicitly target the storage units that were configured to use specific media servers. In NetBackup 6.5.4 and later this behavior remains the default but it can be changed by using the following command: nbemmcmd -changesetting -common_server_for_dup <default|preferred|required> machinename master_server_name Select from the following options: default the default option (default) instructs NetBackup to try to match the destination media server with the source media server. Using the default option, NetBackup does not perform an exhaustive search for the source image. If the media server is busy or unavailable, NetBackup uses a different media server. preferred The preferred option instructs NetBackup to search all matching media server selections for the source. The difference between the preferred setting and the default setting is most evident when the source can be read from multiple media servers, as with Shared Disk. Each media server is examined for the source destination. And for each media server, NetBackup attempts to find available storage units for the destination on the same media server. If all of the storage units that match the media server are busy, NetBackup attempts to select storage units on a different media server. required the required option instructs NetBackup to search all media server selections for the matching source. Similar to the preferred setting, NetBackup never selects a non-common media server if there is a chance of obtaining a common media server. For example, if the storage units on a common media server are busy, NetBackup waits if the required setting is indicated. Rather than fail, NetBackup allocates source and destination on different media servers.
Page 7
Considerations for environments with very large images and many very small images
Though a large MIN_GB_SIZE_PER_DUPLICATION_JOB works well for large images, it can be a problem if you also have many very small images. It may take a long time to accumulate enough small images to reach a large MIN_GB_SIZE_PER_DUPLICATION _JOB. To mitigate this effect, you can keep the timeout that is set in MAX_MINUTES _TIL_FORCE_SMALL_DUPLICATION_JOB small enough for the small images to be duplicated sufficiently soon. However, the small timeout (to accommodate your small images) may negate the large MIN_GB_SIZE_PER_DUPLICATION_JOB setting (which was intended to accommodate your large images). It may not be possible to tune this perfectly if you have lots of large images and lots of very small images. For instance, suppose you are multiplexing your large backups to tape (or virtual tape) and have decided that you would like to put 2 TB of backup data into each duplication job to allow Preserve multiplexing to optimize the reading of a set of images. Suppose it takes approximately 6 hours for one tape drive to write 2 TB of multiplexed backup data. To do this, you set MIN_GB_SIZE_PER_DUPLICATION_JOB to 2000. Suppose that you also have lots of very small images happening throughout the day, that are backups of database transaction logs which need to be duplicated within 30 minutes. If you set MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB to 30 minutes, you will accommodate your small images. However, this timeout will mean that you will not be able to accommodate 2 TB of the large backup images in each duplication job, so the duplication of your large backups may not be optimized. Page 8
(You may experience more re-reading of the source tapes due to the effects of duplicating smaller subsets of a long stream of large multiplexed backups.) If you have both large and very small images, consider the tradeoffs and choose a reasonable compromise with these settings. Remember: If you choose to allow a short timeout so as to duplicate small images sooner, then the duplication of your large multiplexed images may be less efficient. Recommendation: Experience has shown that DUPLICATION_SESSION_INTERVAL_MINUTES tends to perform well at 15 to 30 minutes. Set MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB to twice the value of DUPLICATION_SESSION_INTERVAL_MINUTES. A good place to start would be to try one of the following, then modify these as you see fit for your environment: DUPLICATION_SESSION_INTERVAL_MINUTES = 15 MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB = 30 Or: DUPLICATION_SESSION_INTERVAL_MINUTES = 30 MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB = 60
same way that it treats a physical tape library to NetBackup there is no difference between physical and virtual tape. One common practice with VTLs is VTL staging in which backup images are written to a VTL with a short retention period and subsequently duplicated to some other storage device, usually physical tape, for longer term storage. When using VTL staging it is important to remember that duplication is effectively a serial process. While duplication between virtual and physical tapes can preserve multiplexing it cannot introduce it so it is not possible to configure an environment where a large number of virtual tape drives duplicates to a smaller number of physical tape drives.
Reserve some tape drives specifically for backup jobs and some specifically for duplication jobs. The following guidance applies if you are not using the Shared Storage Option: o Tape drives for the backup jobs: If you need duplication jobs to run concurrently with backup jobs, then define a storage unit for your backup policies which uses only a subset of the tape drives in the device. To do this, ensure that the storage units setting for Maximum concurrent drives is less than the total number of drives in the library. For instance, if your tape device has 25 drives you may want to set Maximum concurrent drives for your backup storage unit to 12, thereby reserving the other 13 drives for duplication (or restore) jobs. (Suppose these 13 drives are for 12 duplication jobs to read images, and for 1 restore job to read images.) o Tape drives for duplication jobs: Define a destination storage unit for your duplication jobs that matches, in number, the number of read drives reserved for duplication at the original device. To follow the previous example, this duplication storage unit would have a Maximum concurrent drives of 12, to match the 12 read drives reserved for duplication on the previous device. For the duplication jobs it is important to keep a 1:1 ratio of read virtual drives to write physical drives. Keep in mind that media contention can still make this inefficient. To reduce media contention, be sure to follow the other guidelines in this document, including tuning the SLP environment for very large duplication jobs (via the LIFECYCLE_PARAMETERS file).
If more drive pairs are available for duplication, the duplication can be done in less time than the original backups. (This is subject to the performance of the hardware involved and SLP duplication batching.)
Preserve multiplexing
If backups are multiplexed on tape, Symantec strongly recommends that the Preserve multiplexing setting is enabled in the Change Destination window for the subsequent duplication destinations. Preserve multiplexing allows the duplication job to read multiple images from the tape while making only one pass over the tape. (Without Preserve multiplexing the duplication job must read the tape multiple times so as to copy each image individually.) Preserve multiplexing significantly improves performance of duplication jobs. On a VTL, the impact of multiplexed images on restore performance is negligible relative to the gains in duplication performance.
Use Inline Copy to make multiple copies from a tape device (including VTL)
If you want to make more than one copy from a tape device, use Inline Copy. Inline Copy allows a single duplication job to read the source image once, and then to make multiple subsequent copies. In this way, one duplication job reads the source image, rather than requiring multiple duplication jobs to read the same source image, each making a single copy one at a time. By causing one duplication job to create multiple copies, you reduce the number of times the media needs to be read. This can cut the contention for the media in half, or better.
The feature can be completely disabled by entering a value of 1 in the file DEFERRED_IMAGE_LIMIT. In most cases the default values should be sufficient to significantly increase the speed of duplications of small images.
All of the best practice considerations for SLPs also apply when using Auto Image Replication but there are also some additional factors to bear in mind and these are discussed in the following sections. In particular it is important to remember that there are limits to the amount of data that can be copied between domains. Do not use Auto Image Replication to duplicate and replicate all of your data offsite unless you have done a thorough study and upgrade of your storage and network bandwidth requirements in order to support such a load. As with SLPs in general, it is essential that you ramp up slowly, starting with only a portion of your backups and slowly adding more.
The SLP that provides the source side of the Auto Image Replication can also be used to create other copies of the backup in the source domain. The source domain can send images to multiple target domains, as configured via the underlying storage. The source domain can also be used as a target domain, with other SLPs that import images duplicated from some other domain(s). The SLP that provides the target side of the Auto Image Replication catalogs the backup in the target domain and may also duplicate the backup to longer term storage in the target domain. This SLP includes an import storage operation where the backup arrives but may also contain other storage operations to which the backup is then duplicated. At least one storage operation in the target domains SLP must specify the Target Retention to en sure that the backup is retained for the period of time specified by the source SLP. However, this does not have to be the import destination and it is typical for the import storage operation to have a short retention period and for the SLP to be configured to duplicate the backup to another destination set to the Remote Retention for long term storage. (Note that in NetBackup 7.1 the import storage operation can only have fixed retention and cannot be configured for expire on duplication). The automatic import operation at the target domain is an optimized operation. The replication of the image to the target domain also replicates the metadata for the image. At the remote site, the metadata need only be moved into the catalog (rather than reconstructed from scratch, as is done by the 2-phase import method).
This behavior can be prevented by setting the parameter AUTO_CREATE_IMPORT_SLP = 0 in the LIFECYCLE_PARAMETERS file on the target domain master server. If this parameter is set and a suitable import SLP does not exist, the target master will show a failed Import job in its Activity Monitor. This failure will not be visible in the source domain where a successful replication will be indicated. Once a suitable SLP exists in the target domain, the failed import process will be processed by the next import job (which will be triggered by the next replication from the source domain).
Storage considerations
Auto Image Replication requires compatible OpenStorage devices (including NetBackup deduplicating devices) to transfer the data between the source and target NetBackup domains because it leverages the unique capabilities of the OpenStorage API in order to make the replication process efficient. OpenStorage devices will need a plug-in that supports the Auto Image Replication feature in order to make use of it. Refer to the NetBackup Hardware Compatibility List for supported storage devices.
Page 13
Bandwidth considerations
The network bandwidth required for Auto Image Replication is unpredictable to NetBackup, based on the fact that we have no way to predict in real-time the deduplication rate of a duplication image set, the Optimized Duplication throughput of the storage device, or current network traffic. Additionally, this is very likely to occur over a WAN which implies longer latencies and lower bandwidth in general. This is another reason that it is wise to plan accordingly and to ramp up slowly.
Restore considerations
The Auto Image Replication feature is primarily intended as a site disaster recovery mechanism. It allows mission critical backups to be selectively copied to a target location where they may be restored to alternate client hardware in a separate NetBackup domain in the event of the loss of the original site. There is no automated method to restore backups directly from the target domain to clients in the source domain. However restores to the original client can still be performed using all manual methods available in previous releases of NetBackup. For instance, one could send a tape containing a copy of the backup that was created at the target domain to the source domain and import it.
Retention considerations
The minimum period of time that a backup is to be held in the target domain is determined by a retention period set by the source SLP in the source domain. This retention period, specified for target storage, may be longer than any retention period set for local copies of the backup in the source domain. (For example a copy of the backup may need to be held for years at a remote repository for compliance reasons.) It is important to remember that the copy in the target domain is not tracked in the catalog of the source domain. Once all copies of the backup in the source domain have expired, users in that domain will need to do one of the following to determine that a copy exists at the target domain: Run the OpsCenter SLP Status Report (by image or by image copy). Run the nbstlutil repllist -U command on the source domains master server to display information about backups that should have a copy in the target domain. (This command uses information retained in the source domains catalog about successful duplications and does not interrogate the source domain directly. As such, this command can provide a false positive if the image was manually expired from the target domain.)
A Snapshot operation is always required to create the initial snapshot. A Replication operation can then be used to control the replication of the snapshot to another volume. A Backup from Snapshot operation can also be used to create a tar-formatted backup from the snapshot on disk backup storage. This backup can then be duplicated to other backup storage such as tape using a duplication operation.
Note: The Backup job that results from the Backup from Snapshot operation is under the control of the SLP and the Duplication Manager. The Duplication Manager decides when to run the backup job, which may be outside of the backup window as defined in the backup policy. Refer to the NetBackup Replication Director Solutions Guide for more details on the setup and operation of Replication Director. Note: Replication Director is not integrated with Auto Image Replication in NetBackup 7.5 and Replication Director SLPs can only use storage operations within a single NetBackup domain.
MAX_GB_SIZE_PER_DUPLICATION_JOB = 25 This entry controls the maximum size of a duplication job. (When a single image is larger than the maximum size, that one image will be put into its own duplication job.) Consider your tolerance for long-running duplication jobs. Can you tolerate a job that will run for 8 hours, consuming tape drives the entire time? Four hours? One hour? Then calculate the amount of data that would be duplicated in that amount of time. Remember that the larger the duplication job is, the more efficient the duplication job will be.
Page 15
Note: For very small environments and evaluation/proof of concept purposes, a greater degree of granularity can be achieved by using the parameters MIN_KB_SIZE_PER_DUPLICATION_JOB and MAX_KB_SIZE_PER_DUPLICATION_JOB instead of the parameters described above. The default values remain the same but the overriding values are specified in kilobytes rather than gigabytes. MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB = 30 Often there will be low-volume times of day or low-volume SLPs. If new backups are small or are not appearing as quickly as during typical high-volume times, adjusting this value can improve duplication drive utilization during low-volume times or can help you to achieve your SLAs. Reducing this value allows duplication jobs to be submitted that do not meet the minimum size criterion. DUPLICATION_SESSION_INTERVAL_MINUTES = 5 This parameter indicates how frequently nbstserv looks to see if enough backups have completed and decides whether or not it is time to submit a duplication job(s). IMAGE_EXTENDED_RETRY_PERIOD_IN_HOURS = 2 After duplication of an image fails 3 times, this is the time interval between subsequent retries. DUPLICATION_GROUP_CRITERIA = 1 This parameter lets administrators tune an important part of the batching criteria. The entry applies to both tape and disk use and has two possible values: 0 = Use the SLP name. Use 0 to indicate that batches be created based on the SLP name. 1 = Use the duplication job priority. Use 1 to indicate that batches be created based on the duplication job priority from the SLP definition. TAPE_RESOURCE_MULTIPLIER = 2 This value determines the number of duplication jobs that the Resource Broker will evaluate for granting access to a single destination storage unit. Storage unit configuration includes limiting the number of jobs that can access the resource at one time. The Maximum concurrent write drives value in the storage unit definition specifies the maximum number of jobs that the Resource Broker can assign for writing to that resource. Overloading the Resource Broker with jobs that it cant run is not prudent. However, we want to make sure that theres enough work queued so that the devices wont become idle. Th e TAPE_RESOURCE_MULTIPLIER parameter lets administrators tune the amount of work that is being evaluated by the Resource Broker for a particular destination storage unit. For example, a particular storage unit contains 3 write drives. If the TAPE_RESOURCE_MULTIPLIER parameter is set to 2, then the Resource Broker will consider 6 duplication jobs for write access to the destination storage unit. MAX_IMAGES_PER_SNAPSHOT_REPLICATION_JOB = 50 Sets the maximum number of snapshot images that can be included in a snapshot replication job. This parameter can be used in a Replication Director configuration to control how many snapshot jobs are sent to the disk array to avoid overloading the replication infrastructure of the OpenStorage partner. To be effective, MAX_IMAGES_PER_SNAPSHOT_REPLICATION_JOB must be used with the Limit I/O streams disk pool option that limits the number of NetBackup jobs that can run concurrently to each volume in the disk pool. The syntax of the LIFECYCLE_PARAMETERS file, using default values, is shown below. Only the nondefault parameters need to be specified in the file, any parameters omitted from the file will use default values. MIN_GB_SIZE_PER_DUPLICATION_JOB = 8 Page 16
Storage Lifecycle Policy reporting is possible in the following views of the SLP Status report set: SLP Status by SLP SLP Status by Destination SLP Duplication Progress SLP Status by Client SLP Status by Image SLP Status by Image Copy SLP Backlog
SLP reporting is included with the basic OpsCenter data gather from the Master Servers. There are only two reports in the Point and Click Report area SLP Status and SLP Backlog. Clicking on these links will open up the GUI to provide a great deal more information before the filtering process starts.
Figure 2 - SLP Status Report The information included in this at a glance report gives an overview of the health of the SLPs in each NetBackup domain (under each master server) listed. It shows Storage Lifecycle Policy completion statistics in percentages and in hard numbers so that an administrator can quickly get a feel for whether there are unexpected issues with the processing of the Storage Lifecycle Policies in any domain. These completion statistics are provided in three different metrics: the number of images, the number of copies to be made, and the size of the copies to be made. Also shown, in tabular format, is the number of images in the backlog in the Images not Storage Lifecycle Policy Complete column. Other fields that are beneficial are the expected and completed sizes. Most of the fields have a hyperlink for additional drilldown. An example is the first hyperlink the Master Server (where the SLP Lives) link. Clicking on the name of the Master will give the SLP Status by SLP report with additional drill downs.
Figure 4 - Auto Image Replication Activity Reporting in the SLP Report This report can be configured to be emailed to the admin on a scheduled basis to help determine if the Auto Image Replication process is working correctly. Page 18
Figure 5 SLP Backlog Report Data for the Backlog report is collected from the OpsCenter database at midnight every day, therefore the information is not real time. The intent of the report is to show a trend in the backlog, over time. The backlog is expected to grow and shrink during each day, but the general trend should be level. If the backlog is growing over the course of many days or weeks when backup volumes are not growing, the reason for the growth should be investigated. It may be that there is not enough infrastructure to handle the amount of duplication traffic the SLPs are generating. To obtain information about the current backlog at any moment in time use the nbstlutil command as described in the earlier section Monitoring SLP progress and backlog growth.
Page 19
About Symantec: Symantec is a global leader in providing storage, security and systems management solutions to help consumers and organizations secure and manage their information-driven world. Our software and services protect against more risks at more points, more completely and efficiently, enabling confidence wherever information is used or stored.
For specific country offices and contact numbers, please visit our Web site: www.symantec.com
Symantec Corporation World Headquarters 350 Ellis Street Mountain View, CA 94043 USA +1 (650) 527 8000 +1 (800) 721 3934
Copyright 2011 Symantec Corporation. All rights reserved. Symantec and the Symantec logo are trademarks or registered trademarks of Symantec Corporation or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners.