TSM Disaster Recovery Strategies
TSM Disaster Recovery Strategies
TSM Disaster Recovery Strategies
Charlotte Brooks
Matthew Bedernjak
Igor Juran
John Merryman
ibm.com/redbooks
International Technical Support Organization
November 2002
SG24-6844-01
Note: Before using this information and the product it supports, read the information in
“Notices” on page xxi.
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii
The team that wrote this redbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii
Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv
Comments welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvi
Contents v
6.5.3 TSM server LAN, WAN, and SAN connections. . . . . . . . . . . . . . . . 112
6.5.4 Remote device connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.6 TSM server and database policy considerations. . . . . . . . . . . . . . . . . . . 113
6.7 Recovery scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Chapter 7. TSM tools and building blocks for Disaster Recovery . . . . . 115
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2 The TSM server, database, and log . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2.1 TSM database page shadowing . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2.2 TSM database and recovery log mirroring . . . . . . . . . . . . . . . . . . . 119
7.3 TSM storage pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.4 Requirements for TSM server Disaster Recovery . . . . . . . . . . . . . . . . . . 121
7.5 TSM backup methods and supported topologies . . . . . . . . . . . . . . . . . . 124
7.5.1 Client backup and restore operations . . . . . . . . . . . . . . . . . . . . . . . 124
7.5.2 Traditional LAN and WAN backup topology . . . . . . . . . . . . . . . . . . 127
7.5.3 SAN (LAN-free) backup topology . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.5.4 Server-free backup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.5.5 Split-mirror/point-in-time copy backup using SAN. . . . . . . . . . . . . . 130
7.5.6 NAS backup and restore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.5.7 Image backup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.6 TSM Disaster Recovery Manager (DRM) . . . . . . . . . . . . . . . . . . . . . . . . 134
7.6.1 DRM and TSM clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.7 TSM server-to-server communications . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.7.1 Server-to-server communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.7.2 Server-to-server virtual volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.7.3 Using server-to-server virtual volumes for Disaster Recovery . . . . 142
7.7.4 Considerations for server-to-server virtual volumes . . . . . . . . . . . . 143
7.8 TSM and high availability clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.8.1 TSM and High Availability Cluster Multi-Processing (HACMP) . . . . 144
7.8.2 TSM backup-archive and HSM client support with HACMP . . . . . . 146
7.8.3 TSM and Microsoft Cluster Server (MSCS) . . . . . . . . . . . . . . . . . . 148
7.8.4 TSM backup-archive client support with MSCS . . . . . . . . . . . . . . . 150
7.9 TSM and remote disk replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.10 TSM and tape vaulting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.10.1 Electronic tape vaulting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.11 Remote disk mirroring and tape vaulting solutions . . . . . . . . . . . . . . . . 155
7.11.1 Collocation considerations for offsite vaulting . . . . . . . . . . . . . . . . 156
7.11.2 Reclamation considerations for offsite vaulting . . . . . . . . . . . . . . . 157
Contents vii
11.2.1 Exclude files for mksysb backup . . . . . . . . . . . . . . . . . . . . . . . . . . 242
11.2.2 Saving additional volume group definitions . . . . . . . . . . . . . . . . . . 243
11.2.3 Classic mksysb to tape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
11.2.4 Mksysb to CD-ROM or DVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
11.2.5 The use of TSM in addition to mksysb procedures . . . . . . . . . . . . 247
11.2.6 Bare metal restore using mksysb media . . . . . . . . . . . . . . . . . . . . 248
11.3 Using NIM for bare metal restore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
11.3.1 Disaster Recovery using NIM and TSM . . . . . . . . . . . . . . . . . . . . 249
11.3.2 Basic NIM Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
11.3.3 AIX bare metal restore using NIM . . . . . . . . . . . . . . . . . . . . . . . . . 261
11.3.4 NIM administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
11.4 SysBack overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
11.4.1 Network Boot installations using SysBack . . . . . . . . . . . . . . . . . . 264
Contents ix
Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
Figures xiii
xiv Disaster Recovery Strategies with Tivoli Storage Management
Tables
Examples xix
xx Disaster Recovery Strategies with Tivoli Storage Management
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area.
Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product, program, or service that
does not infringe any IBM intellectual property right may be used instead. However, it is the user's
responsibility to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not give you any license to these patents. You can send license
inquiries, in writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer
of express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may
make improvements and/or changes in the product(s) and/or the program(s) described in this publication at
any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm
the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on
the capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrates programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the
sample programs are written. These examples have not been thoroughly tested under all conditions. IBM,
therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy,
modify, and distribute these sample programs in any form without payment to IBM for the purposes of
developing, using, marketing, or distributing application programs conforming to IBM's application
programming interfaces.
The following terms are trademarks of International Business Machines Corporation and Lotus Development
Corporation in the United States, other countries, or both:
ActionMedia, LANDesk, MMX, Pentium and ProShare are trademarks of Intel Corporation in the United
States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the
United States, other countries, or both.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun
Microsystems, Inc. in the United States, other countries, or both.
C-bus is a trademark of Corollary, Inc. in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
SET, SET Secure Electronic Transaction, and the SET Logo are trademarks owned by SET Secure
Electronic Transaction LLC.
Other company, product, and service names may be trademarks or service marks of others.
This redbook is organized into two parts. Part 1 presents the general Disaster
Recovery Planning process. It shows the relationship (and close interconnection)
of Business Continuity Planning with Disaster Recovery Planning. It also
describes how you might set up a Disaster Recovery Plan test. Various general
techniques and strategies for protecting your enterprise are presented. Part 2
focuses on the practical, such as how to use IBM Tivoli Disaster Recovery
Manager to create an auditable and easily executed recovery plan for your Tivoli
Storage Manager server. It also shows approaches for bare metal recovery on
different client systems.
This book is written for any computing professional who is concerned about
protecting their data and enterprise from disaster. It assumes you have basic
knowledge of storage technologies and products, in particular, IBM Tivoli Storage
Manager.
Charlotte Brooks is a Project Leader for Tivoli Storage Management and Open
Tape Solutions at the International Technical Support Organization, San Jose
Center. She has 12 years of experience with IBM in the fields of RISC
System/6000 and Storage. She has written eight redbooks, and has developed
and taught IBM classes on all areas of storage management. Before joining the
ITSO in 2000, she was the Technical Support Manager for Tivoli Storage
Manager in the Asia Pacific Region.
Jon Tate
International Technical Support Organization, San Jose Center
Dan Thompson
IBM Tivoli, Dallas
Jeff Barckley
IBM Software Group, San Jose
Tony Rynan
IBM Global Services, Australia
Ernie Swanson
Cisco Systems
Nate White
UltraBac Software
Your efforts will help increase product acceptance and customer satisfaction. As
a bonus, you'll develop a network of contacts in IBM development labs, and
increase your productivity and marketability.
Find out more about the residency program, browse the residency index, and
apply online at:
ibm.com/redbooks/residencies.html
Preface xxv
Comments welcome
Your comments are important to us!
Part 1 Disaster
Recovery
Planning
In this part we overview the Disaster Recovery Planning process, including basic
definitions, the SHARE Disaster Recovery tiers, and an introduction to Business
Impact Analysis and Business Continuity Planning. We describe how to test and
maintain a Disaster Recovery Plan, and factors to consider when setting up a
data center to maximize availability. Finally, we concentrate on IBM Tivoli Storage
Manager, how to relate the planning process to specific product capabilities, and
some of the basic Tivoli Storage Manager tools for protecting the client and
server.
This book covers a wide range of topics, from high level business disaster
recovery planning, to specific TSM Disaster Recovery functions and to operating
system bare metal restore. Therefore, we believe this book can provide value to
those in a wide range of roles, including: customers developing DR strategies, IT
Managers, IT Specialists, TSM Administrators, Disaster Recovery Consultants,
Sales Specialists, and other related professionals.
Part 3 - Appendices
Disaster Recovery and Business Impact Analysis Templates, Windows BMR
Configuration Scripts, Sample DRM Plan.
This book focuses on those TSM concepts specific to Disaster Recovery and
assumes that the reader has experience with general TSM concepts. Therefore,
we do not intend to provide a basic introduction to TSM. However, we provide
references to other TSM Redbooks and manuals where appropriate.
Database
Backup
Software
52.0%
Operations 30.0%
25.0%
Software
Other 13.0%
9.0%
Hardware Application
15.0% 27.0%
Application Network
8.0% Hardware Other
10.0%
8.0% 3.0%
Understanding risks and the associated cost of downtime for your business is a
critical element of the planning process. Lost revenue is only a portion of the
Employee Costs
Employee and contractor idle time
Salaries paid to staff unable to undertake billable work
Direct Fiscal Losses
Lost revenues
Delays in enterprise accounting
Loss of revenue for existing service contracts (customer SLA failure)
Lost ability to respond to contract opportunities
Loss of interest on overnight balances
Cost of interest on lost cash flow
Long Term Losses
Penalties from failure to provide tax and annual reports
Loan rate fluctuations based on market valuation
Loss of control over debtors
Loss of credit control and increased bad debt.
Delayed profitability for new products and services
Brand image recovery
Loss of share value
Lost market share
Recovery Site Costs
Cost of replacement of buildings and plant
Cost of replacing infrastructure and equipment
Cost of replacing software
Cost of DR contract activations
Cost of third party and contractor support
These losses are quantified and documented during the Business Impact
Analysis (BIA) phase of the BCP process. Critical processes are identified and
analyzed at a business level to determine the actual cost of downtime for each
process. The enterprise cost of downtime varies from industry to industry, but in
general the costs can be staggering. Figure 1-4 shows the cost of IT downtime
across many US industries. These costs impose rigorous demands for data
availability on the enterprise.
Financial / Banking
Insurance
Pharmaceuticals
Credit Card
Banking
Online Auction
Retail
Brokerage Operations
Energy
Information Tech.
Manufacturing
It doesn’t take an independent analyst to realize that the costs associated with
creating and assuring availability for the enterprise rise dramatically as you
approach the requirement for 100% availability. The real challenge is defining the
balance between the relative cost of downtime and the cost of maintaining
availability for critical business processes. The following chapters about Disaster
Recovery and TSM discuss planning methods to maximize availability for the
enterprise.
What steps do you follow to assess and remedy this situation? Does your staff
have a detailed plan to follow? Have you designed the IT infrastructure to recover
from this kind of outage in a timely fashion? Read the rest of this book to learn
about strategies to deal with these questions and help you sleep better at night.
Business Continuity
Business continuity describes the processes and procedures an organization
puts in place to ensure that essential functions can continue during and after a
disaster. Business Continuity Planning seeks to prevent interruption of
mission-critical services, and to re-establish full functioning as swiftly and
smoothly as possible.
Risk Analysis
A risk analysis identifies important functions and assets that are critical to a firm’s
operations, then subsequently establishes the probability of a disruption to those
functions and assets. Once the risk is established, objectives and strategies to
eliminate avoidable risks and minimize impacts of unavoidable risks can be set. A
list of critical business functions and assets should first be compiled and
prioritized. Following this, determine the probability of specific threats to business
functions and assets. For example, a certain type of failure may occur once in 10
years. From a risk analysis, a set objectives and strategies to prevent, mitigate,
and recover from disruptive threats should be developed.
DR hotsite
A DR hotsite is a data center facility with sufficient hardware, communications
interfaces and environmentally controlled space capable of providing relatively
immediate backup data processing support.
DR warmsite
A DR warmsite is a data center or office facility which is partially equipped with
hardware, communications interfaces, electricity and environmental conditioning
capable of providing backup operating support.
DR coldsite
A DR coldsite is one or more data center or office space facilities equipped with
sufficient pre-qualified environmental conditioning, electrical connectivity,
communications access, configurable space and access to accommodate the
installation and operation of equipment by critical staff required to resume
business operations.
High Availability
High availability describes a system’s ability to continue processing and
functioning for a certain period of time — normally a very high percentage of
time, for example 99.999%. High availability can be implemented in your IT
infrastructure by reducing any single points-of-failure (SPOF), using redundant
components. Similarly, clustering and coupling applications between two or more
systems can provide a highly available computing environment.
Windows 95 AIX
Windows 98 HP-UX
Windows NT Solaris
Local Area Network / Wide Area Network Linux
Windows NT DEC
Alpha Windows NT
Storage Area Network Windows 2000
Windows 2000
Windows XP Windows XP
Macintosh MVS, z/OS
AIX OS/400
z/OS OS/2
NUMA-Q VM
Sequent PTX NSM
AS/400
Solaris
HP-UX
IRIX
Linux
TSM Application & Database Support
NCR Unix SVR4
OpenVMS Disk
Digital Unix
Tru64 Unix Tivoli Data Protection for Applications:
Tandem Guardian Lotus Domino/Notes Optical
Fujitsu Microsoft SQL Server / Exchange Server
DG/UX Informix
SCO UNIX 386 Oracle (RMAN) Tape
Sinix (386/486) IBM DB2 UDB (integrated Functionality)
Sinix Reliant SAP R/3
Pyramid Nile WebSphere Application Server TSM Storage Hierarchy
Novell Netware Intelligent Disk Subsystem Support
NEC EWS-UX/V EMC Symmetrix Timefinder
OS/2
IBM Shark Flashcopy
SCO Open
Desktop
Tivoli Storage Manager provides data protection, disaster recovery, and storage
management functionality for the enterprise. TSM storage management services
include:
Operational Backup and Restore of Data: The backup process creates a
copy of the data to protect against the operational loss or destruction of file or
application data. The customer defines how often to back up (frequency) and
how many copies (versions) to hold. The restore process places the backup
copy of the data back onto the designated system or workstation.
Disaster Recovery: By the creation of multiple copies of enterprise data,
TSM supports the implementation of site to site recovery operations. Disaster
Recovery with TSM includes moving data to offsite locations, rebuilding or
initializing TSM infrastructure, and reloading data to clients in an acceptable
time frame. Many such scenarios are discussed in later chapters.
Vital Record Retention, Archive, and Retrieval: The archive process
creates a copy of a file or a set of files for long term storage. Files can remain
on the local storage media or can be deleted. The customer controls how long
(retention period) an archive copy is to be retained. The retrieval process
locates the copies within the archival storage and places them back into a
customer-designated system.
TSM Client
Data
WAN,
LAN,
SAN Disk Storage Pool
Copy Storage
Pool
TSM Server
Migration
Write
Copy
Tape Storage Pool
Figure 1-6 Data movement with TSM and the TSM storage hierarchy
TSM client data is moved via SAN or LAN connections to the TSM server, written
directly to disk or tape primary storage pool (and optionally simultaneously to a
copy storage pool in TSM 5.1), migrated to other storage primary storage pools,
and copied as many times as necessary to additional copy storage pools.
In addition to data backup, archive copies of data can also be created using
TSM. Archive creates an additional copy of data and stores it for a specific
amount of time — known as the retention period. TSM archives are not expired
until the retention period is past, even if the original files are deleted from the
client system.
Therefore, the difference between backup and archive is that backup creates
and controls multiple backup versions that are directly attached to the original
file; whereas archive creates an additional file that is retained for a specific
period of time.
Oracle, Informix, Lotus Notes, Lotus Domino, Microsoft Exchange, Microsoft SQL
Server, SAP R/3, and WebSphere Application Server each have their own
storage management interface or TDP application which integrates with the TSM
data management API in each TSM data protection application. DB2 from IBM
integrates the TSM API directly, without requiring a separately purchased TDP
product. Some of the TSM data protection applications leverage IBM and EMC
intelligent disk subsystem advanced copy functions such as FlashCopy and
TimeFinder. This functionality bridges TSM and high-availability storage
infrastructures to maximize application availability.
Before a communication session between the TSM Client and the TSM Server
begins, an authentication handshaking process occurs with authentication tickets
and a mutual suspicion algorithm. The TSM security protocol is modeled after
the Kerberos network authentication protocol, which is a highly respected
method for secure signon cryptography. The client uses its password as part of
an encryption key, and does not send the password over the network. Each
session key is unique, so replaying a session stream will not result in a signon to
the TSM server. This significantly lowers the chance of a TSM session being
hijacked by an outside user.
To heighten security for TSM sessions, data sent to the TSM server during
backup and archive operations can be encrypted with standard DES 56-bit
encryption. For WAN implementations of TSM across public networks, data
encryption compliments and completes data security for TSM.
A list of TSM tools and strategies for protection against disasters and for
recovering in the event of disasters is given here:
Database and recovery log mirroring
Database page shadowing
Storage pool manipulation for disaster protection
Varied client backup operations
Varied backup methods and topologies
TSM Disaster Recovery Manager (DRM)
TSM server-to-server communications
TSM server-to-server virtual volumes
TSM and high availability clustering
TSM and remote disk replication
TSM traditional and electronic tape vaulting.
TSM and system bare metal restore integration
DRM overview
Tivoli Storage Manager delivers Tivoli Disaster Recovery Manager (DRM) as part
of its Extended Edition. DRM offers various options to configure, control and
automatically generate a Disaster Recovery Plan containing the information,
scripts, and procedures needed to automate restoration of the TSM Server and
helps ensure quick recovery of client data after a disaster. It also manages and
tracks the media on which TSM data is stored, whether on site, in-transit, or in a
vault, so that data can be easily located if disaster strikes. It generates scripts
which assist in documenting IT systems and recovery procedures, as well as
providing automated steps to rebuild the TSM server.
TSM APIs
The TSM APIs are used for Tivoli’s own TDP products (see 1.4.6, “Tivoli Data
Protection for Applications modules” on page 17), but they are also published
and documented. This allows ISVs to adapt their solutions to integrate with TSM
to extend its functionality. In particular, various vendors have used the APIs to
provide bare metal recovery solutions for various platforms. Among the vendors
exploiting these APIs for Disaster Recovery include Cristie, UltraBac Software,
and VERITAS Bare Metal Restore. More information on these companies and
Once the tiers have been described we outline specific TSM functions and
strategies that can be used to achieve the various tiers.
In 1992, the SHARE user group in the United States, in combination with IBM,
defined a set of DR tier levels. This was done to address the need to properly
describe and quantify various different methodologies for successful
mission-critical computer systems DR implementations. Accordingly, within the IT
Business Continuance industry, the tier concept continues to be used, and is
very useful for describing today's DR capabilities. The tiers’ definitions are
designed so that emerging DR technologies can also be applied. These tiers are
summarized in Figure 2-1.
The following sections provide an overview of each of the tiers, describing their
characteristics and associated costs. Typical recovery times (based on industry
experience and the capabilities of the recovery strategy) are also noted. The
purpose is to introduce these tiers for those not familiar with them, and then later
directly link these recovery tiers with TSM DR strategies.
Datacenter A
Daily Backups
Figure 2-3 Tier 1 - Offsite vaulting (PTAM)
Because vaulting and retrieval of data is typically handled by couriers, this tier is
described as the Pickup Truck Access Method (PTAM). PTAM is a method used
by many sites, as this is a relatively inexpensive option. It can, however, be
difficult to manage, that is, it is difficult to know exactly where the data is at any
point. There is probably only selectively saved data. Certain requirements have
been determined and documented in a contingency plan and there is optional
backup hardware and a backup facility available.
While some customers reside on this tier and are seemingly capable of
recovering in the event of a disaster, one factor that is sometimes overlooked is
the recovery time objective (RTO). For example, while it may be possible to
eventually recover data, it may take several days or weeks. An outage of
business data for this period of time can have an impact on business operations
that lasts several months or even years (if not permanently).
Note: The typical length of time for recovery is normally more than a week.
Datacenter A Datacenter B
(hotsite)
Offsite
Vault
Daily At Recovery
Backups Time
Figure 2-4 Tier 2 - Offsite vaulting with a hotsite (PTAM + hotsite)
Note: The typical length of time for recovery is normally more than a day.
Offsite
Vault
Daily At Recovery
Backups Time
Figure 2-5 Tier 3 - Offsite electronic vaulting
The hotsite is kept running permanently, thereby increasing the cost. As the
critical data is already being stored at the hotsite, the recovery time is once again
significantly reduced. Often, the hotsite is a second data center operated by the
same firm or a Storage Service Provider.
Note: The typical length of time for recovery is normally about one day.
Daily Backup
Offsite
Vault
At Recovery Time
Figure 2-6 Tier 4 - Electronic vaulting with hotsite (active secondary site)
In this scenario, the workload may be shared between the two sites. There is a
continuous transmission of data between the two sites with copies of critical data
available at both sites. Any other non-critical data still needs to be recovered from
the offsite vault via courier in the event of a disaster.
Note: The typical length of time for recovery is usually up to one day.
High Bandwidth
Connections
Figure 2-7 Tier 5 - Two-site, two-phase commit
Note: The typical length of time for recovery is usually less than 12 hours.
Data Sharing
Figure 2-8 Tier 6 - Zero data loss (advanced coupled systems)
Note: The typical length of time for recovery is normally a few minutes.
Figure 2-9 illustrates the relationship between the tiers of disaster recovery
solutions, recovery time, and cost.
Requires
Tier 5 - Two-site Two-phase commit Dedicated
Remote Hot
Site
Cost
Tier 4 -
Electronic
Vaulting
and Hot
Site Tier 3 -
Electronic Tier 2 -
Requires Active Vaulting and
Secondary Site Vaulting
Hot Site Tier 1 -
(PTAM + Hot Offsite
Site) Vaulting
Point-in-Time Backup (PTAM)
Time to Recover
Figure 2-9 Seven tiers of Disaster Recovery solutions
Many of the tiers described define the ability to recovery your data. The
distinction between the tiers are how quickly you need to recover your data
(RTO), how quickly you need to recover the services provided by your
environment, and how much data you cannot afford to lose (RPO). Therefore, a
recovery solution should be chosen based on your business’ unique recovery
criteria versus how much it will cost the company in lost revenue due to being
down and unable to continue normal business processing. The shorter the time
period required to recover the data to continue business processing, the higher
the cost. Almost always, the longer a company is unable to process transactions
the more expensive the outage is going to be for the company.
Table 2-1 provides a summary of strategies and techniques that can be used with
TSM to achieve the various tiers of disaster recovery.
With site disaster there will be no ability to recover except for rebuild of
environment.
1 Offsite Vaulting Storage pool vaulting with TSM server environment (fully integrated).
Also Known as Requires a DRP and careful management of offsite volumes. Consider use of
Pickup Truck Disaster Recovery Manager (DRM) which can automate the TSM server
Access Method recovery process and manage offsite volumes.
(PTAM)
Strategy includes vaulting of TSM database, recovery log, volume history
information, device configuration information, DRP file (if using DRM) and copy
pools for storage at an offsite location.
Consider use of TSM virtual volumes over TCP/IP connection to allow storage
of TSM entities (TSM database backups, recovery log backups, and primary
and copy storage pools, DRM plan files) on remote target servers, which store
as archive files on a the target server.
Strategy can include vaulting of TSM database, recovery log, volume history
information, device configuration information, DRP file (if using DRM) and copy
pools for storage at an offsite location.
4 Electronic TSM servers installed at both locations, optionally setup as peer to peer
Vaulting to servers (that is, each server able to recover at the alternate site).
hotsite (active
secondary site) Requires a DRP and careful management of offsite volumes. Consider use of
Disaster Recovery Manager (DRM) which can automate the TSM server
recovery process and manage offsite volumes.
Strategy may include use of TSM virtual volumes over TCP/IP connection to
allow storage of TSM entities (TSM database backups, recovery log backups,
and primary and copy storage pools, DRM plan files) on remote target servers,
which store as archive files on a target server.
High bandwidth connections and data replication technology (for example, IBM
PPRC, EMC SRDF) will support asyschronous data replication of TSM
databases backups and recovery log backups. TSM storage pools with critical
data can be replicated as well. Extended distances can be achieved by using
distance technologies, for example, extended SAN, DWDM, IP/WAN Channel
Extenders.
Solution may include remote electronic tape vaulting of TSM database and
recovery log backups, primary or copy storage pools. Extended distances can
be achieved by using distance technologies, for example, extended SAN,
DWDM, IP/WAN Channel Extenders.
5 Two-site, TSM servers installed at both locations, optionally setup as peer to peer
Two-phase servers (that is, each server able to recover at the alternate site).
Commit
High bandwidth connections and data replication technology (for example, IBM
PPRC, EMC SRDF) will support syschronous data replication of TSM
databases backups and recovery log backups or mirrors. TSM storage pools
with critical data can be replicated as well. Extended distances can be
achieved by using distance technologies, for example, extended SAN, DWDM,
IP/WAN Channel Extenders.
Solution may include remote electronic tape vaulting of TSM database and
recovery log backups, primary or copy storage pools. Extended distances can
be achieved by using distance technologies, for example, extended SAN,
DWDM, IP/WAN Channel Extenders.
6 Zero Data Loss TSM servers installed at both locations, optionally setup as peer to peer
servers (that is, each server able to recover at the alternate site).
Requires dual active data centers, high availability application (for example,
HACMP, HAGEO, MSCS) to support hot failover of server(s) from one data
center to the other. Use of clustering/High Availability solutions to failover TSM
server environment.
High bandwidth connections and data replication technology (for example, IBM
PPRC, EMC SRDF) will support syschronous data replication of TSM
databases backups and recovery log backups or mirrors. TSM storage pools
with critical data can be replicated as well. Extended distances can be
achieved by using distance technologies, for example, extended SAN, DWDM,
IP/WAN Channel Extenders.
Solution may include remote electronic tape vaulting of TSM database and
recovery log backups, primary or copy storage pools. Extended distances can
be achieved by using distance technologies, for example, extended SAN,
DWDM, IP/WAN Channel Extenders.
The financial industry traditionally leads in terms of stringent regulations for data
protection, security, and contingency planning. While some countries or regions
still require hardcopy contingency copies of financial data, others are quickly
migrating towards a completely electronic format for data. Increasing reliance on
electronic data, forms, and processes underscores the importance of integration
of enterprise storage management into the Disaster Recovery Planning process.
Recent legislative trends also are driving government organizations and health
care providers to meet similar requirements for business continuity and disaster
preparedness. In the US, the Health Insurance Portability and Accountability Act
(HIPAA) requires the entire health care industry to securely manage and protect
patient data through contingency planning and security measures.
Legal systems and regulatory groups also have varying definitions for
court-admissible electronic records. In many cases, WORM optical media is the
only acceptable format for non-tampered data for court proceedings. Moving data
to optical media is not complex, however developing enterprise policies and
systems which support legal and technical requirements such as these is
increasingly challenging.
Often, entire data center operations have grown though ad-hoc planning
processes, with little or no metrics for storage management and enterprise
disaster recovery. These trends, along with corporate mergers, acquisitions,
consolidations, and globally distributed IT operations have created a myriad of
scenarios where no single solution can simply solve the storage management
challenge.
Application services now pull data from dozens of data sources and the web of
consequent dependencies for data continues to increase in complexity. The
proliferation of complex Web based applications, messaging, and data
management applications is changing the paradigm for what backup/restore has
traditionally meant to a business. Now, strategic planning and systems
management design is necessary to meet business requirements for availability
and recovery.
The ubiquity of relational databases, e-mail systems, and rich media dependent
systems (scanned documents, high-quality images, audio, video) all contribute to
the growth of storage in data processing and customer environments. Emerging
technologies (image recognition, advanced image analysis, wireless
applications, smart card, and so on) will only increase the demand for open,
scalable, manageable, high performing, and relatively sophisticated storage
systems.
One of the primary limitations of this approach is the fact that only one logical
copy of data exists on the primary, secondary, or tertiary mirror at any given time.
If the data is corrupted for any of a host of reasons, which is likely in an actual
disaster (unstable network connection, abrupt application shutdown,
hardware/operating system failure), the mirrored copies could also be rendered
useless for recovering operations. In subsequent chapters, we will discuss the
importance of policy driven systems and procedures, which in some cases do
incorporate these technologies for systems availability.
Today, Storage Area Network (SAN) technologies are being quickly adopted to
enhance performance, scalability, and flexibility of shared storage resources. The
SAN is a dedicated infrastructure for storage I/O, based on widely adopted
industry standards for hardware components, Fibre Channel protocol (FCP), and
ubiquitous SCSI standards. SANs allow hosts to access storage resources over
a network as if they were locally attached. Current Fibre Channel devices
support data transfer rates of 2 Gbps and in the future will support up to 10 Gbps
rates per connection. Like network connections, multiple Fibre Channel
connections can be established between hosts and devices, allowing highly
scalable and reliable methods for data transfer. To date, approximately 50% of
enterprise scale organizations have deployed a SAN in a production capacity.
Fibre optic distance technologies open a new paradigm of data movement over
long distances. Dense wavelength division multiplexing (DWDM) is a technology
that allows multiple streams and protocols of data to be combined on one or
several long distance fibre optic connections. This means that IP and FCP traffic
can be consolidated and routed at high speeds across long distances. Up to 200
billion bits per second (200 Gbps) can be delivered over a single optical fibre. In
theoretical terms, this means 100 TB of data could be moved to a remote site in
approximately 68 minutes over a single fibre/DWDM connection.
The numbers are calculated as follows. Assume a 1 Gbps Ethernet link. Dividing
by 8 to show the number of GBps gives .125. Therefore, 10 GB would be
transferred in 80 seconds, or 1.33 minutes. Note that no network actually
achieves its theoretical performance — real rates of between 60% and 80%
usually are seen more in production networks.
Network Technology Protocol Bandw idth Time Required to Transfer Data Volume (Minutes)
In later chapters we will explore the integration of these technologies into TSM
architectural planning scenarios.
TSM planning and policy development, if properly done, will make this valuable
data accessible at the host and enterprise level. Understanding the enterprise
landscape, and creating policies to achieve business continuity, disaster
recovery, and performance goals can all be achieved through a methodical
planning process, which we discuss in greater detail in Chapter 7, “TSM tools
and building blocks for Disaster Recovery” on page 115.
The scope of a Business Continuity Plan will most certainly involve the IT
infrastructure. Most Business Continuity professionals view Disaster Recovery
Planning as an IT-centric logical subset of the Business Continuity Planning
process.
The focus of this redbook is to emphasize critical elements of BCP and DRP as
they relate to the use of TSM in an enterprise environment. For more
comprehensive background information on Business Continuity Planning, please
refer to IBM TotalStorage Solutions for Disaster Recovery, SG24-6457.
4
Process, Procedure, and BIA / RPO
Policy Definition / RTO
Analysis
6 5
Recovery Data
Plans Protection /
Policy
Creation
7 8
Plan Testing Change
& Training Management
Figure 3-4 The relationship between BIA, RTO, and TSM planning
The Business Process Analysis identifies critical processes for the Business
Impact Analysis (BIA). From the BIA, an application environment will usually be
assigned a cost per hour value, which directly impacts continuity plans and the
Recovery Time Objective (RTO). Essentially, the value of the data and the
access to the data directly correlates to policies and infrastructure decisions for
backup and recovery.
Once critical and supporting systems are identified, general policies for storage
management can be applied. The rationale behind this approach stems from the
principle that an application environment is only as effective as its weakest
component. If a supporting system which sends critical data to a critical system
fails, the overall critical business process may be compromised. These situations
are easily overlooked in enterprise planning efforts due to the increasing
complexity of data processing environments. An additional analysis of data within
each system can then be used to classify data based on restore priority.
In Figure 3-6, the RTO and NRO are usually the same, because customer
access to the data is just as important as the system restoration. The scope of
network recovery procedures can center on critical systems environments and
spread to the enterprise, depending on network architectures in place. The RPO
depends strictly on the cost of recreating application data and can vary from
system to system. In general, this kind of planning exercise aids in the
organization of system priorities, storage management policies, and DR plans.
Since the audience includes a wide array of talent, the plan language needs to
be concise and accessible to technical and non-technical readers. A well written
plan provides a roadmap to IT recovery for current or replacement IT staff. After
all, non-standard staff could be partly or fully responsible for IT recovery in a
disaster scenario.
Since every IT environment is unique, a DRP must be built from a thorough and
site specific planning process. A balance must be struck between technical detail
and plan flexibility, to ensure a concise, functional, and scalable plan for the
enterprise. A DRP outlines team roles, responsibilities, and specific procedures
for restoring an environment during an unplanned outage. A DRP can be used on
several scales, ranging from system specific outages, to partial site outages, to
massive site failures. A well designed DRP supports business continuity for a
variety of outage situations.
Background Information
Introduction
Concept of operations
Business
Process Analysis
Risk Analysis/BIA Notification/Activation Phase
BIA/RPO/RTO Notification procedures
Analysis Damage assessment
Plan activation
Policy Creation
Policy, Process
and Procedures Recovery Phase
Plan Sequence of recovery activities
Development Recovery procedures
Testing / Training
Plan
Maintenance
Reconstitution Phase
Migrate recent data to production
site
Test systems and infrastructure
Commission production site
Design Appendices
Contact lists
System requirements
Vital records
The DRP organizes the data from the BCP and DR Planning processes into an
action plan for IT recovery. The five core sections of the Disaster Recovery Plan
are described in detail in the following sections. We have also included a sample
IT Disaster Recovery Plan template, which is located in “Disaster Recovery Plan
Template” on page 332.
Introduction
An introduction has a purpose, scope, and commitment statement:
Purpose: An explanation of the DRP development, operational needs for
availability and recovery, and the overall plan objective.
Scope: Identifies what the plan covers and does not cover. The plan design
and intent is related to specific situations and scenarios as addressed in this
Concept of operations
The concept of operations provides a clear description of what infrastructure
exists, how the DR/operations teams are organized, and how the recovery plan
functions through various disaster scenarios. The concept of operations
subsections can include:
Environment overview: Provides a high-level overview of IT operations with
written and graphic explanations of where systems reside, where production
and recovery facilities reside, and where disaster recovery staff work in
normal and recovery operations.
System descriptions: Provides an architectural view and description of
critical infrastructure (including servers, storage devices, backup/recovery
systems, networks, Storage Area Networks, firewalls, and
telecommunications/ISP connections) and a general written description of
“how things work” in IT operations.
Recovery scenarios: Describes how the plan functions in a variety of
unplanned outage events, ranging from entire site loss to single systems
failure. This section is an extremely critical component for establishing
expectations for how the recovery plan will function in a variety of scenarios.
There is no way to plan perfectly for a disaster, however understanding the
plan design and testing the plan in a variety of scenarios is the best way to
achieve full disaster preparedness.
Responsibilities: Outlines team member roles and responsibilities. Usually
an organization chart shows the team hierarchy to establish rules for decision
making, succession, escalation and replacement in the event of staff loss.
Greater detail is then assigned to specific roles (typically organized by
position name instead of personal name) for Disaster Recovery Plan
activation.
Damage assessment
The damage assessment is typically performed by the damage assessment
team and follows a critical action plan to determine the nature of and extent of
site damage. Assuming the assessment team is not at risk of injury, the following
elements are generally included in the assessment scope:
Cause of disruption
Potential risks for additional disruptions or damage
Scope of damage and disruption
Physical infrastructure assessment (including structural integrity, power,
cooling, heating, ventilation, fire-suppression, telecommunications, and
HVAC)
Functional status of equipment (fully functional, partially functional,
nonfunctional)
Type of damage to IT equipment and media (including water damage, fire and
heat, physical impact, electrical surge, electomagnetic pulse, and so on)
The damage assessment impacts directly the extent to which the DRP is
implemented. Scenarios range from system specific outages (such as, hardware
failure, virus, or hacking) to site wide disaster situations, which would require the
full implementation of the Disaster Recovery Plan. Depending on the scope of
damage and the consequent recovery operations, the appropriate teams and
resources are notified according to the documented procedures.
Plan activation
The plan activation depends on the plan activation criteria, which is a set of
organization-specific metrics for decision making in an unplanned outage event.
Two critical phases comprise the recovery phase: sequence of recovery activities
and recovery procedures.
If systems are being recovered at an alternate site, some or all of the following
components must be either already available at, or delivered to the recovery site:
backup tapes, hardware, software, software licenses, recovery plans, staff, and
even food/water supplies. Such activities are site and plan specific, but careful
planning and preparation simplifies the movement of resources from one site to
Recovery procedures
Recovery procedures provide detailed procedures to restore systems and
supporting infrastructure components. Assuming general facilities and
infrastructure is in place or restored, recovery procedures generally target
specific recovery team members and address the following broad guidelines:
Installing hardware components
Recovery or re-install of operating system images/backups
Configuring network resources
Restoring/configuring system data
Restoring application software
Restoring application data
Testing system functionality and security controls
Connecting system to production network
Testing
Administering and monitoring replacement systems
Once all systems are restored, tested, and functional, the Disaster Recovery
team transitions back to normal operational procedures. This will continue while
further decisions are made, about if or when a roll-back to the original or
replacement site is necessary.
On one hand, insourcing BCP/DRP can bind the most detailed knowledge of
business processes and supporting infrastructure to an in-house plan, because
the full time employees who are ultimately responsible for the plan generation,
maintenance, and testing will have years of experience within the environment.
Since much of the planning process is based on the discovery of existing process
and resources, this strategy offers a distinct advantage over outsourcing
Outsourcing, on the other hand, offers customers the ability to off load several
tasks to specialized service providers. Offsite tape vaulting, facilities, workspace,
telephone, and hardware contracts comprise the majority of disaster recovery
services available today. Risk assessments, threat assessments, and business
process analysis studies can also be out sourced components of a Business
Continuity Plan.
If contracting for the DR site with a commercial vendor, adequate testing time,
work space, security requirements, hardware requirements, telecommunications
requirements, support services, and recovery days (how long the organization
can occupy the space during the recovery period) must be negotiated and clearly
stated in the contract.
The overall business structure of an enterprise will remain relatively stable over a
period of time. A Disaster Recovery Plan is a vital element for an enterprise to
describe how the continuity of the business processes will be preserved in case
of a disaster. The technical details and the human resources of a business
requirement typically change more frequently. An update process for the Disaster
Recovery Plan is necessary, so its functionality and effectiveness is preserved.
Ensure that all technical staff dealing with IT production take part in regular
testing. You must consider the possibility that some or all of the key technical
personnel may not be available during an actual disaster (for example, because
of vacation, injury, or other personal circumstance). Ideally, the testing should be
done by staff not specifically experienced in the platforms involved. This will
expose any “missing steps” in procedures which an expert would automatically
know how to do. We recommend that you have backups identified and assigned
for all the critical members of the DR team.
The details of the testing plan will depend on the tier level of your DR solution. If
you do not have a hotsite and only Tier 1 level of recovery, the testing plan is very
simple. You have base disaster recovery capability, you keep your vital data
offsite and you will establish the recovery on appropriate hardware. On the other
hand if you have invested in a dedicated remote hotsite (Tier 6 level) the testing
and planning will be much more complex.
Remember, your success in being able recover in a real disaster depends on the
quality of your DR Plan, and your capability to execute it. Testing is really the key
to validating and enhancing your Disaster Recovery Plan, as well as giving your
Disaster Recovery team the experience and confidence to execute it.
Table 4-1 Differences between DR testing and real disaster for Tiers 1-3
DR testing Real disaster
People under less stress than a People under stress can make
real disaster. more mistakes.
Table 4-2 Differences between DR testing and real disaster for Tiers 4-6
DR testing Real disaster
People under less stress than a People under stress can make
real disaster. more mistakes.
It is recommended to stay for If disaster strikes, there may be no
some time on the backup site, but definite information on when the
roll-back time is set. roll-back time will be, and
operations on the backup site can
run in some cases for months.
Roll-back process is similar to the If the primary site has been
DR test, but in reverse (backup destroyed by the disaster, a new
site restoring to the primary). This roll-back plan should be prepared,
gives the opportunity to test the because the new primary site may
DRP twice with different staffing. be different from the old one.
Some core procedures contained in the DRP are probably performed regularly
(in non-disaster situations) by expert IT administrators. Examples include
restoring an AIX system from a mksysb tape, or re-loading Windows 2000 from
the operating system CD. However all of these procedures are still required to be
documented properly in the DRP. Even an expert, when under stress, can make
mistakes, and even the experts may not be available at the crucial time.
The DRP should provide for two recovery cases. In the first case, the original
(primary) site is completely destroyed. You will have to continue operations at the
backup site for some time while the replacement equipment is ordered, installed
and commissioned. In this case, the eventual switch to the new (replacement)
site will usually occur over a period of time, and could be phased. However, some
disasters may only be temporary — for example, if physical, electrical, or
communications access is denied to your primary site for a period of time. In this
case, once services were restored, you would want to roll-back (return) to the
primary site as soon as possible. This procedure would be equivalent to the
original DR in terms of timing, but in this case returning to the primary from the
secondary. When you perform the test of the overall DRP you should consider
how the return to normal operation will be achieved. The development of complex
Disaster Recovery testing plans on higher tier levels can be difficult and can have
a significant effect on daily operations. We suggest that from Tier 4 and higher to
consult with a specialized Disaster Recovery service provider.
Customers have had good DR test experiences whereby they switch to the
backup site one weekend, remain there for one week, then roll-back to the
original site on the following weekend. In this way, the customer stays on the
backup site for five working days. Disaster Recovery Plan testing should be done
annually, because IT equipment, and applications are changing very rapidly. Also
IT staff volatility mandates regular education, re-education and practical training.
This example shows how a simple individual test scenario can be documented.
Note that it makes reference to the enterprise’s own procedures manuals.
Task: Restore the NT application server, SV002, after a total hardware crash.
The operating system (system disk) and the application will be restored. This
scenario assumes that the application data is located on a disk array which was
not destroyed. The system partition was backed up by an IBM TSM server the
previous night, and the backup should be consistent.
Appropriate hardware 1
Essential steps: Follow the internal procedure for restore of Windows NT server.
This procedure is described in the internal Systems Restore Procedures on page
xxx (Note: of enterprise’s own procedures manual).
The procedure for each test scenario must be provided in writing — containing all
the required steps, supporting information and suggestions for improvement (for
example, after a test run).
Task: Restore operations at the remote site after the primary site was totally
destroyed by disaster.
05:00am Check if weekly TSM backup finished and all Operator on duty.
Saturday offsite tape was moved to backup location.
Essential steps: Next, Table 4-5 shows an example of the DR testing scenario.
Review: All steps done in the DR testing procedure should be recorded. This
record will become basic document for post-testing review. An example of a
format which could be used is shown in Table 4-6.
Utilize every opportunity for testing. Investment in testing is never a waste of time
or money. A DRP is like car insurance. You pay money to the insurance company
(which is a financial commitment) but in the case of an accident you expect that
all damages will be covered by the insurance company. If you didn’t think the
insurance company could pay the damages, you would not pay for their
coverage. Similarly, if your DRP does not work, it needs to be fixed — however
you will not know if it works, and what is wrong with it, until you test it.
If involved personnel either have no access to the current DRP, or have instead,
an obsolete version of it, this could seriously jeopardize a company’s ability to
correctly respond to a disaster.
This section describes how the maintenance of the DRP could be performed.
The procedures given here are examples only. The actual procedures must be
discussed and agreed to internally and modified in order to meet individual
specific requirements.
The list of members of the Approval Board should be recorded with at least the
information shown in Table 4-7.
At regular intervals (for example, every six months) the Document Owner
contacts the Task Owners to see if changes are necessary. In case of updates,
the Document Owner contacts the Approval Board.
The Approval Board is responsible for the official sign off of all changes and
finally approves the release of the changes or updates made.
After receiving the release from the Approval Board, the Document Owner issues
a new Version Number. The Document Owner records the changes in the
Change History. The Document Owner arranges for the printout of the amended
DRP and initiates the distribution of the DRP to the Distribution List.
As this document contains vital customer information, the Document Owner has
to choose a secure distribution channel. The Document Owner requests the
return of all obsolete versions of the DRP.
The Document Owner collects the obsolete versions of the DRP and initiates
destruction of these documents.
Obsolescence
Scrapping of obsolete documents must be according to the general scrapping
procedure for internal customer documents.
Audits
If no changes or updates have been initiated within the last 12 months, the
Approval Board initiates a review or audit of the DRP in order to ensure the
correctness of the document. The review must include the correctness of the
technical procedures and that all contact names, telephone numbers and
addresses are valid.
General issues
The source-files of the DRP must be protected against uncontrolled modification
and unauthorized access. The hard copies of the DRP should not be made
available to unauthorized persons.
The document must be available to authorized staff at any time in order to ensure
immediate activation of the DRP when necessary.
The members of the Management Team, who are responsible for the recovery
process should have access to the valid DRP copy at all times, for example, by
Plan of audits
Table 4-10 records the audit process.
Change History
Table 4-11 can be used to document the changes performed.
Release Protocol
Table 4-12 may be used to keep track of version releases.
Release Date
Distribution
The DRP will be distributed to the distribution list shown in Table 4-13.
Name Department
Date
To distribution-list:
Version:
Release Date:
Please replace the complete handbook with this new version and return the
obsolete document to my attention by (insert date).
Signature
TSM provides storage management services for backup, recovery, and disaster
recovery. TSM functionality depends directly on the existence of a well designed
and managed environment.
In the following sections, we discuss the importance of strategic planning and the
importance of mapping design metrics to availability requirements. The role of IT
architecture planning has become central to strategic planning for the enterprise.
To create a manageable enterprise environment, organizations must incorporate
strategic architecture planning into data center procedures. Enterprise hardware
and software standards, architectural design metrics, and the establishment of
business requirement driven policy all make a tremendous difference in creating
a manageable enterprise environment.
UPS System
Power Management
Systems
Transformers Backup
Generator
Power Grids
Power Generation
Facilities
Preventative controls must be documented and integrated into the overall DRP.
General awareness of how these measures are used is very important for
personnel (for example, appropriate response if a fire alarm sounds), and can be
instilled via drills, documentation provided to employees and so on. This will
ensure they know what to do in the case of a real disaster.
Site locations can also be kept virtually private from public awareness by the use
of unmarked buildings and data center locations. Some government and energy
installations, for instance, limit the number of people who have knowledge of
data center locations and access procedures. However paranoid, these
measures provide an excellent level of protection and security.
Microsoft cluster server is also supported with the TSM application running on
Windows server platforms. Clustering techniques for TSM in the Windows
environment are discussed in Using TSM in a Clustered Windows NT
Environment, SG24-5742, and in 7.8.3, “TSM and Microsoft Cluster Server
(MSCS)” on page 148.
Several varieties of RAID are available for data protection, however the most
commonly used RAID solutions use RAID-1, RAID-3, and RAID 5, described
here:
RAID-1 Otherwise known as disk mirroring, RAID-1 is generally the most
expensive and fault tolerant method for protecting data. Some RAID-1
implementations also use striping (RAID-0) to improve performance, with the
effective result being a RAID 0+1 solution. Mirroring can be either positive or
Techniques to use these services along with TSM are discussed, in detail, in 7.9,
“TSM and remote disk replication” on page 151.
Network Adapter
Network Switch
Network
Backbone
Network Carrier
FC FC FC
FC FC
FC Switch FC Switch
FC FC
max. 10km
FC FC
FC Switch FC Switch
FC FC
max. 500m
FC FC
Mirrored Site: Near zero or zero data loss: Highly automated takeover
on a complex-wide or business-wide basis, using remote disk mirroring, Mirrored
TSM for backup/recovery operations.
Cost / Complexity
Time to Recover
Figure 5-4 Cost versus recovery time for various alternate site architectures
There are obvious cost and recovery time differences among the options. The
mirrored site is the most expensive choice, because it ensures virtually 100
percent availability. Coldsites are the least expensive to maintain; however, they
require substantial time to acquire and install necessary equipment. Partially
equipped sites, such as warmsites, fall in the middle of the spectrum. Table 5-1
summarizes the criteria that can be employed to determine which type of
alternate site meets the organization's business continuity and BIA requirements.
Mirrored
Hot
Warm
Cold
As site costs and architectures are evaluated, primary production site security,
management, operational, and technical controls must map adequately to the
alternative site design.
Multiple party commitments for alternative site provisioning should carefully map
business requirements for availability to a joint DR Planning process.
Identification of mutual risks is key to developing a multi-party Disaster Recovery
Plan and infrastructure. For both recovery sites, the DR sequence for systems
from both organizations needs to be prioritized from a joint perspective. Testing
should be conducted at the partnering sites to evaluate the extra processing
thresholds, compatible system and backup configurations, sufficient
telecommunications and network connections, and compatible security
measures, in addition to the functionality of the recovery strategy.
At an enterprise level, TSM policy must meet overall business requirements for
data availability, data security, and data retention. Enterprise policy standards
can be established and applied to all systems during the policy planning process.
At the systems level, RTO and RPO requirements vary across the enterprise.
Systems classifications and data classifications typically delineate the groups of
systems and data along with their respective RTO/RPO requirements. Data
classification schemes add a necessary layer of sophistication to policy
generation, to effectively streamline backup and recovery operations. Specific
variables for backup type, backup frequency, number of backup copies, archive
variables also can be mapped to these groups of systems and data.
Figure 6-1 shows the overall relationship between planning, policy, and
infrastructure design processes, including TSM definitions which we will discuss
in more detail in this chapter.
RTO RPO
Storage Pools
Device
Infrastructure
Classes
TSM Storage
Pools
Figure 6-2 TSM data movement and policy subjects
TSM capabilities allow a great deal of flexibility, control, and granularity for policy
creation. Due to the abundant choices that exist for policy creation, we can make
general suggestions which can then be implemented in a variety of ways. In an
enterprise environment, flexibility and choice in policy creation directly affects
resource utilization, systems architecture, and overall feasibility of restore
procedures.
P o lic y D o m a in
P o lic y S e t
B a c k u p C o p y G ro u p
M anagem ent
C la s s
A r c h iv e C o p y G r o u p
B a c k u p C o p y G ro u p
M anagem ent
C la s s
A r c h iv e C o p y G r o u p
B a c k u p C o p y G ro u p
M anagem ent
C la s s
A r c h iv e C o p y G r o u p
N ode N ode
Policy Domain A
RTO/RPO Requirements Copy group
Nodes Rules Management Class A Storage Pool
Policy Domain B
Copy group
RTO/RPO Requirements
Rules Management Class B Storage Pool
Nodes
Policy Domain C Copy group
RTO/RPO Requirements Rules Management Class C Storage Pool
Nodes
Figure 6-4 Relation of RTO and RPO requirements to management classes
Figure 6-4 shows how systems and data, based on criticality, can be bound to
management classes to support specific recovery objectives. Within each
management class, specific copy group rules are applied.
Copy Group
Copy groups belong to management classes. Copy groups control the
generations, storage destination, and expiration of backup and archive files.
The copy group destination parameter specifies a valid primary storage pool
to hold the backup or archive data. For performance and capacity
considerations, the association of copy groups to specific storage pool sizing
and planning procedures is extremely important.
Backup Methods
Another critical variable for client policy is the type of backup method to be
performed. Full, progressive, adaptive subfile, backup sets, or image backups
can be performed for different sets of data.
Figure 6-5 illustrates the connection between storage pools, device classes, and
devices. Each copy group in a management class has a particular storage pool
defined to it where data bound to that management class will be stored. The
device class maps the particular physical devices (disk, drives and libraries)
defined to TSM to the storage pools.
Clients
Migrate
Tape
Volume Volume
Management
Class Represents
Media
Copy Points to Storage Pool
Group
Device Class
Points to
Device
Storage pools logically map to device classes, which can be designed to meet
enterprise level requirements for performance and capacity, or more granular
performance requirements based on specific RTO objectives for critical data.
Every component in the TSM architecture must be designed to support the most
rigorous RTO requirements in the enterprise.
TSM architectural design should not only factor in current requirements, but also
future requirements for growth and change. A one to three year horizon for
capacity planning should be built into any TSM sizing event. A three to five year
technology plan should also be defined for TSM resources, including platform,
disk, SAN, tape media, and tape devices. Strategic architectural planning and
growth metrics are key.
At a high level, the overall data volume and data type (number and type of files)
shape the basic metric for TSM solutions design. Policies for versioning,
retention, data location, backup frequency, and archiving all directly impact the
overall amount of data managed by TSM. The amount of data and how it is
managed then affects the following elements of a TSM environment:
TSM server (CPU, Memory, I/O)
TSM database and recovery log volumes and layout
TSM disk pool architecture and performance
TSM tape pool architecture and performance
TCP/IP and SAN configurations
Tape media type
While this is a simple example, the same methodology can be applied to critical
systems and critical data in an enterprise environment. Critical data can be
grouped together and associated with storage pool resources and network
connections which meet RTO requirements for recovery.
Once data policy requirements have been defined, the physical devices required
(disk, tape drives and libraries) can be then selected to meet general
performance and capacity requirements.
Figure 6-6 illustrates how management class definitions bind specific data to disk
storage pools contained within the disk device class.
(PPRC/Flashcopy)
Logical Volume
Storage Pool A Critical Data A Disk Copy
Disk Sub-System
Figure 6-6 Management class relation to disk storage pools for critical data
While critical data remains in the onsite disk storage pool for fast access and
restore, logical volume copies can be made using subsystem specific copy
functions (such as FlashCopy or SRDF) and additional copies of the data can be
made to tape copy storage pools. In this way, TSM strategically positions critical
data for highly efficient disk based restores if available, while maintaining
additional disk or tape copies in local and remote locations.
Within a disk storage pool definition, the MIGDELAY parameter is used to define
retention periods for data within the storage pool. If designing individual disk
storage pools to meet rigorous RTO requirements, this variable adds an excellent
level of control for storage pool retention. MIGDELAY time settings should match
backup frequencies for this kind of data.
High-end disk subsystems also offer the ability to create local or remote mirrored
copies of data contained in disk storage pools. Advanced copy functions include
ESS Flashcopy for local volumes and ESS PPRC for site to site volume
replication or mirroring. These advanced copy functions can complement
contingency requirements for mission critical management class data which
needs to be stored on disk for availability reasons. These methods are discussed
Performance and capacity considerations must be made when sizing the disk
device and related disk storage pools. We recommend the use of scalable and
high performance disk subsystems for disk storage pools. The disk subsystem
I/O performance and connectivity must support enterprise level demands for
RTO.
Management Class C
Management Class B
Management Class A Low Volume Restores
0
Disk Storage Pool Tape Storage Pool and Library A Tape Storage Pool and Library B
Device Class Examples
Aggregate
Aggregate Bandwidth=216 GB/Hr Bandwidth=
108 GB/Hr
Tape Drive 15 Tape Drive 15 Tape Drive 15 Tape Drive 15 Tape Drive 15 Tape Drive 15
MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec
Another important element of tape library selection involves strategic planning for
tape media and devices. Enterprise tape considerations should include:
Tape library hardware strategy for technology, performance, and scalability
Tape format/density and tape media road map
Hardware compression (on or off)
Software or TSM compression standards for client data
Tape drive connectivity (SAN or SCSI)
An important factor for tape media management is the volume per cartridge,
which inevitably increases with time and innovation. If tape pools contain mixed
Disk Storage Pool Tape Storage Pool A Tape Storage Pool B Tape Storage Pool C
Figure 6-8 File versions, management class grouping, and storage pools
Table 6-2 shows the basic metrics for network topologies and data transfer rates.
We divide each base rate by eight to convert bits to bytes. For example, 10Mbps
is.125 Mbps. We then multiply by 3600 (seconds in an hour), and divide by 1000
and 1 million respectively to give the rates for GB/hour and TB/hour. These
metrics do not take into consideration protocol overhead, and assume perfect
disk, tape and application performance. Therefore, it is more typical to expect
and achieve actual throughput numbers of between 60 and 80% of these
numbers in real environments. LAN and SAN network connections must support
RTO requirements for each critical TSM client.
For remote operations, such as copy storage pool routines from a local storage
pool to a remote tape copy pool, distance network architecture needs to support
DR requirements to restore large volumes of critical data. If distance network
The TSM database can (and should) be protected on many levels. The locally
attached disk storage must support mirroring or RAID-5 protection for the TSM
database and recovery logs. The TSM database must be backed up as
frequently as the most frequently backed up data in the enterprise. Specifically,
the RTO and RPO for TSM must be equally or more aggressive than the most
mission critical system in the enterprise.
For instance, if a mission critical file system is backed up every hour, the same
backup policy must be applied to the TSM database. Large TSM environments
often take frequent incremental backups and daily full backups of the TSM server
database. Since TSM client data can only be accessed through the TSM
application, the TSM database must be routinely backed up and safeguarded.
S to ra ge P o ol A Ac c o u ntin g D ata C la im s D a ta
S to ra ge P o ol B C u s to m e r C a re D a ta
S to ra ge P o ol C P a y ro ll D a ta F rau d Tra ck in g D a ta C laim s D a ta
S to ra ge P o ol D D a ta w a re ho u s e C o re D ata
S to ra g e P o o l E E R P A p p lica tio n N T W o rks tatio n D a ta
D ata
S to ra g e P o o l A 3 0 G B / H o u r T h ro u g h p u t
S to ra g e P o o l B 1 5 G B / H o u r T h ro u g h p u t
S to ra g e P o o l C 4 0 G B / H o u r T h ro u g h p u t
S to ra g e P o o l D 3 0 G B / H o u r T h ro u g h p u t
Storage Pools
TSM Database
and Policy
TSM Server Primary
Recovery Log Copy
Schedules
Backup and
archive clients
Administrative Administrative
client command schedule
LAN managed node
SAN m anaged node
Client schedule
TSM
TSM server: The TSM server is the program that provides backup, archive,
restore, retrieve, space management (HSM), and administrative services to client
systems (also known as nodes). There can be one or many TSM servers in your
environment to meet data protection needs or balance resources. The TSM
server contains and uses its own dedicated database and recovery log.
TSM recovery log: A log of updates that are about to be written to the database.
The server uses the recovery log as a scratch pad for the database, recording
information about client and server transactions while the actions are being
performed. The log can be used to recover from system and media failures.
T S M S e rv e r
1 . R e a d s a d a ta b a s e DB Log In M e m o ry
p a g e in to th e b u ffe r fo r b u ffe rs b u ffe rs B u ffe r P o o ls
u p d a te s .
3 . W rite s th e d a ta b a s e
p a g e b a c k to th e
d a ta b a s e fro m th e
b u ffe r.
TSM DB R e c o v e ry L o g
When a transaction occurs on the TSM server or between it and a TSM client,
the TSM server updates (reads and writes to) the TSM database and recovery
log as required. Backing up files from a client node to the server and storing them
Chapter 7. TSM tools and building blocks for Disaster Recovery 117
in a storage pool is an example of a transaction. When a transaction occurs, the
server:
1. Reads a database page in the database buffer and updates it.
2. Writes a transaction log record to the recovery log describing the action
occurring and associates it with the database page in case the page needs to
be rolled back during recovery.
3. Writes the database page to the database, releasing it from the buffer pool.
E a c h d a ta b a s e p a g e w r ite is
D a t a b a s e p a g e w r it e
w r itte n tw ic e :
1
1 . O n c e to a s h a d o w p a g e 2
a re a
VOL
2 . T h e n a g a in to th e a c tu a l VOL
lo c a t io n w h e r e th e p a g e
b e lo n g s D a ta b a s e
page
D a ta b a s e shadow
v o lu m e
During server startup the pages in the shadow area are compared with those in
the real location in the database to determine if any have been partially written. If
partially written pages are detected in the shadow page area, processing simply
continues as before, that is, the real pages are rebuilt and written during
transaction recovery. If the pages in the shadow are intact, but one or more
pages in their real location are partially written, the pages in the shadow page
area are copied over the real page addresses.
You can prevent the loss of the database or recovery log due to a hardware
failure on a single drive, by mirroring them on separate physical drives. Mirroring
simultaneously writes the same data to multiple volumes as shown in Figure 7-4.
However, mirroring does not protect against a disaster or a hardware failure that
affects multiple drives or causes the loss of the entire system. While Tivoli
Storage Manager is running, you can dynamically start or stop mirroring and
change the capacity of the database. TSM provides 2-way or 3-way mirroring.
D a ta b a se
R e c ov ery TS M
lo g S erver
S torag e
P o ol
M irror C op ie s
M IR R O R
M IR R O R
D ata b as e
R e co ve ry
log
O nsite for O ffsite fo r
re covery d isaste r reco very
Chapter 7. TSM tools and building blocks for Disaster Recovery 119
Mirroring can be crucial in the recovery process. Consider the following scenario.
Because of a sudden power outage, a partial page write occurs. The recovery
log is corrupted and not completely readable. Without mirroring, recovery
operations cannot complete when the server is restarted. However, if the
recovery log is mirrored and a partial write is detected, a mirror volume can be
used to construct valid images of the missing data.
Client Data
TSM
Server
Copy
Storage
Pool
Onsite Copy Offsite Copy
A copy storage pool provides an additional level of protection for client data. It is
created by the administrator backing up a primary storage pool (using the COPY
STGPOOL command) to another storage pool defined as a copy. The copy storage
pool contains all current versions of all files, active and inactive, exactly as they
appear in the primary storage pool. A copy storage pool provides protection from
partial and complete failures in a primary storage pool. An example of a partial
failure is a single tape in a primary storage pool which is lost or found to be
defective. When a client attempts to restore a file which was on this volume, the
server will automatically use a copy storage pool volume (if available onsite)
containing that file to transparently restore the client’s data. If a complete primary
storage pool is destroyed, for example in a major disaster, the copy storage pool
is used to recreate the primary storage pool. A copy storage pool can use
sequential access storage (for example, tape, optical or FILE device classes), or
copy storage pools can also be created remotely on another Tivoli Storage
Manager server, therefore providing electronic vaulting.
Tip: TSM supports an extensive list tape drives, autoloaders, libraries and
optical devices. For a full list of supported devices please refer to the following
Web site:
http://www.tivoli.com/support/storage_mgr/requirements.html
Chapter 7. TSM tools and building blocks for Disaster Recovery 121
– Volume history file
– Device configuration file with the applicable device information (library,
drive, and device class definitions)
– Database and recovery log setup (the output from detailed queries of your
database and recovery log volumes)
Database backups
TSM can perform full and incremental database backups while the server is
running and available to clients. The backup media can then be stored onsite or
offsite and can be used to recover the database up to the point of the backup.
You can run full or incremental backups as often as needed to ensure that the
database can be restored to an acceptable point-in-time (Figure 7-6).
Recovery Copy
Server Storage
TSM
Server Pools
A snapshot database backup can also provide disaster protection for the TSM
database. A snapshot backup is a full database backup that does not interrupt
the current full and incremental backup series. Snapshot database tapes can
then be taken offsite for recovery purposes and therefore kept separate from the
normal full and incremental backup tapes.
Volume history
Every volume that is used by TSM for storage pools and server database
backups, is tracked within the server database. This information is very important
because it indicates which volume holds the most recent server database
backup. Volume history information is stored in the database, but during a
database restore, it is not available from there. To perform a restore, therefore,
the server must get the information from the volume history file. The volume
history file can be maintained as a text file by specifying its name and location
with the VOLUMEHISTORY option in the dsmserv.opt file. It is very important to
save the volume history file regularly with the BACKUP VOLHISTORY command. You
can specify multiple volume history files by repeating the VOLUMEHISTORY
stanza in the server options file. If you use DRM, then it will automatically save a
copy of the volume history file in its Disaster Recovery Plan file.
Device configuration
The device configuration file contains information required to read backup data.
This information includes devices class definitions, library definitions, drive
definitions, server definitions. This information is stored in the database, but
during a database restore, it is not available from there. To perform a restore,
therefore, the server must get the information from the device configuration file.
The device configuration file can be maintained as a text file by specifying its
name and location with the DEVCONFIG option in the dsmserv.opt file. It is very
important to save your device configuration file regularly with the BACKUP
DEVCONFIG command. You can specify multiple device configuration files by
repeating the DEVCONFIG stanza in the server options file. If you use DRM,
then it will automatically save a copy of the device configuration file in its Disaster
Recovery Plan file.
Chapter 7. TSM tools and building blocks for Disaster Recovery 123
Procedures and recommendations for TSM server and client recovery are
discussed in more detail in:
Chapter 8, “IBM Tivoli Storage Manager and DRM” on page 163
Chapter 10, “Solaris client bare metal recovery” on page 229
Chapter 11, “AIX client bare metal recovery” on page 239
Chapter 12, “Windows 2000 client bare metal recovery” on page 267
Chapter 13, “Linux client bare metal recovery” on page 289.
This information is provided to help you determine which approach you may want
to consider as part of your Disaster Recovery strategy. It is not intended to
exhaustively discuss all backup scenarios.
Progressive The standard method of Helps ensure complete, The user can restore just
incremental backup used by the Tivoli effective, policy-based the version of the file that
backup Storage Manager backup of data. Eliminates is needed (depending on
backup/archive client. After the the need to retransmit the retention parameters).
first, full backup of a client backup data that has not Tivoli Storage Manager
system, incremental backups been changed during does not need to restore a
are done. Incremental backup successive backup base file followed by
by date is also available. operations. incremental backups. This
means reduced time and
No additional full backups of a fewer tape mounts, as well
client are required after the first as less data transmitted
backup. over the network.
Selective Backup of files that are Allows users to protect a The user can restore just
backup selected by the user, subset of their data the version of the file that
regardless of whether the files independent of the normal is needed. TSM does not
have changed since the last incremental backup need to restore a base file
backup. process. followed by incremental
one. This means reduced
time, fewer tape mounts,
and less data over the
network.
Adaptive A method that backs up only Maintains backups of data The base file plus a
subfile the parts of a file that have while minimizing connect maximum of one subfile is
backup changed since the last backup. time and data restored to the client.
The server stores the base file transmission for the
and subsequent subfiles (the backup of mobile and
changed parts) that depend on remote users.
the base file. The process
works with either the standard Applicable to clients on
progressive incremental Windows systems.
backup or with selective
backup.
Chapter 7. TSM tools and building blocks for Disaster Recovery 125
Type of Description Usage Restore options
backup
operation
Journal- Aids all types of backups Reduces the amount of Journal-based backup has
based (progressive incremental time required for backup. no effect on how files are
backup backup, selective backup, The files eligible for restored; this depends on
adaptive subfile backup) by backup are known before the type of backup
basing the backups on a list of the backup operation performed.
changed files. The list is begins. Applicable to
maintained on the client by the clients on Windows NT
journal engine service of the and Windows 2000
Tivoli Storage Manager systems.
backup/archive client.
Image Full volume backup. Allows backup of an entire The entire image is
backup Nondisruptive, online backup is file system or raw volume restored.
possible for Windows 2000 and as a single object. Can be
Linux clients by using the Tivoli selected by backup-
Storage Manager snapshot archive clients on UNIX
function. and Windows systems.
Used by Windows clients
that are using server-free
data movement.
Image Full volume backup, which can Used only for the image The full image backup plus
backup with be followed by subsequent backups of NAS file a maximum of one
differential differential backups. servers, performed by differential backup are
backups using Tivoli Data restored.
Protection for NDMP.
Archive The process creates a copy of Use for maintaining copies The selected version of the
files and stores them for a of vital records for legal or file is retrieved on request.
specific time. historical purposes. If you
frequently create archives
for the same data,
consider using instant
archive (backup sets)
instead. Frequent archive
operations can create a
large amount of metadata
in the server database
resulting in increased
database growth and
decreased performance of
expiration server
operations.
Instant The process creates a backup Use when portability of the The files are restored
archive set of the most recent versions recovery media or rapid directly from the backup
of the files for the client, using recovery of a backup- set. The backup set
files already in server storage archive client is important. resides on media that can
from earlier backup operations. Also use for efficient be mounted on the client
archiving. system, for example, CD,
tape drive, file system. The
TSM server does not have
to be contacted for the
restore process, so the
network and TSM server
are not used.
Chapter 7. TSM tools and building blocks for Disaster Recovery 127
UNIX Large Systems Windows TSM Clients
LAN/WAN
TCP/IP
Application TSM
Backup
Server or
Client Server
Direct Direct
Attached Attached
Storage Storage
Data
Flow
Figure 7-7 TSM LAN and WAN backup
LAN/WAN
TCP/IP
TSM
Application Server or Client Backup
Server
FC FC
FC FC
FC
Data
Flow
Figure 7-8 TSM LAN-free backup
Chapter 7. TSM tools and building blocks for Disaster Recovery 129
Large TSM Clients
UNIX Systems Windows
LAN/WAN
TCP/IP
TSM Backup
Server-Free
Server
Client
FC FC
FC FC FC
3rd Party
Server-Free Extended Copy
Client Data Device
(Datamover)
Data Flow
Data that has been backed up using server-free data movement can be restored
over a server-free path, over a LAN-free path, or over the LAN itself. The impact
on application servers is now minimized with Server-Free Data Movement. It
reduces both TSM client and server CPU utilization. The use of a SCSI-3
extended copy command causes data to be transferred directly between devices
over the SAN or SCSI bus. The data mover device must support the SCSI-3
EXTENDED COPY command, which conforms to the ANSI T10 SPC-2 standard.
The data mover device can be anywhere in the SAN, but it has to be able to
address the LUNs for both the disk and tape devices it is moving data between.
LAN/WAN
TCP/IP
FC FC FC
Split Mirror/
Point-in-time
Data copy
Fl
Figure 7-10 TSM split-mirror/point-in-time copy backup
Chapter 7. TSM tools and building blocks for Disaster Recovery 131
TSM Clients
UNIX Large Systems Windows
Pre-installed
TSM Client
Application TSM Server
Server or Client
FC FC
FC FC FC
Data Flow
The Tivoli Data Protection for NDMP product is now available with Tivoli Storage
Manager (TSM) Version 5 Extended Edition. It provides backup and recovery
support on TSM servers for NAS file servers from Network Appliances. NAS file
servers often require a unique approach to providing backup and recovery
services, because these file servers are not typically intended to run third-party
software. The NAS file server does not require installation of Tivoli Storage
Manager software. Instead, the TSM server uses NDMP to connect to the NAS
file server to initiate, control, and monitor a file system backup or restore
operation, as shown in Figure 7-12. Tivoli Data Protection for NDMP utilizes the
Network Data Management Protocol (NDMP) to communicate with and provide
backup and recovery services for NetApps NAS file servers. NDMP is an
industry-standard protocol that allows a network storage-management
application to control the backup and recovery of an NDMP-compliant file server
without installing third-party software on that server. The implementation of the
NDMP server protocol enables the NAS file servers to be backup-ready and
enables higher-performance backup to tape devices without moving the data
over the LAN.
N AS File r
LA N /W A N
T C P /IP NDM P
bac ku p
co ntrol
M o u nte d T SM
N A S File Se rve r
S ystem
FC FC D ire ct
A ttach e d
Ta p e
FC FC
D ata
F lo w
Figure 7-12 TSM and NDMP backup
With image backup, the TSM server does not track individual files in the file
system image. File system images are tracked as individual objects and the
management class policy will be applied to the file system image as a whole. An
image backup provides the following benefits.
Image backup can provide a quicker backup and restore than a file-by-file
backup as there is no overhead involved in creating individual files.
Conserves resources on the server during backups since only one entry is
required for the image.
Provides a point-in-time picture of your file system, which may be useful if
your enterprise needs to recall that information.
Restores a corrupt file system or raw logical volume. Data is restored to the
same state it was when the last logical volume backup was performed.
Chapter 7. TSM tools and building blocks for Disaster Recovery 133
Figure 7-13 illustrates the process for image backup with TSM.
On the Windows 2000 client platform a Logical Volume Storage Agent (LVSA)
has been introduced which is capable of taking a snapshot of the volume while it
is online. Optionally only occupied blocks can be copied. If the snapshot option is
used (rather than static) then any blocks which change during the backup
process are first kept unaltered in an Original Block File. In this way the client is
able to send a consistent image of the volume as it was at the start of the
snapshot process to the TSM server.
LAN/WAN
TCP/IP
TSM
TSM Client Backup
Server
Image volume
or mount point
x:\
Data
Flow
Figure 7-13 TSM image backup
This section has provided an overview of many of the common backup methods
supported by TSM. These methods can be integrated with DR strategies being
considered in your environment.
One of the key features of Tivoli Storage Manager and Tivoli Disaster Recovery
Manager is the ability to track media in all states that it could possibly be, such as
onsite, in transit or in a vault. Because of the criticality of data in the production
environment, controls are needed to make sure that all previously backed up
data can be found and restored in a reasonable amount of time.
Chapter 7. TSM tools and building blocks for Disaster Recovery 135
B A C K U P E N V IR O N M E N T 1 . D IS A S T E R R E C O V E R Y P L A N N IN G
A d m in
C lie n ts " p re p a re "
2 . O F F S IT E M E D IA M A N A G E M E N T
V AU LT
E AC H
D VOLUM E
T S M S e rv e r R TRACKED
M
O N S IT E R E T R IE V E
3 . A U TO M A T E D R E C O V E R Y O F T S M S E R V E R
DB
T S M D a ta b as e
T S M S to ra g e P o o ls
DRM uses the PREPARE command to generate a plan file that will contain critical
information needed for recovery. Information in the plan is arranged in stanzas,
these can be considered to be somewhat like headers. For example the stanza
PLANFILE.DESCRIPTION shown in Example 7-1 provides summary information
about the plan file as a whole.
end PLANFILE.DESCRIPTION
A detailed description, recovery scenario, and recovery plan built with DRM is
given in Chapter 8, “IBM Tivoli Storage Manager and DRM” on page 163. Also,
recommendations and examples of using DRM to store client machine
information in the DRM plan file for use during a client disaster recovery are
given in Chapter 12, “Windows 2000 client bare metal recovery” on page 267
and Chapter 13, “Linux client bare metal recovery” on page 289.
Chapter 7. TSM tools and building blocks for Disaster Recovery 137
because it is from the server that clients can continue to backup their data and
recover if required. For the clients, DRM allows the machine information needed
to help recover the TSM clients to be stored in the TSM server database. The
type of information stored includes:
TSM client machine location, machine characteristics, and recovery
instructions.
Business priorities associated with the TSM client machines.
Description, location, and volume/diskette/CD labels of TSM client boot
media.
With this information stored in the DRM plan file you can then use appropriate
operating system CDs or tape images to perform a bare metal recovery (BMR) of
your client system. Finally, the TSM client on that machine can be reinstalled
which allows for restoration of client data from the TSM server. Strategies for
bare metal recovery and recovery of data on various client platforms are
discussed in detail in:
Chapter 10, “Solaris client bare metal recovery” on page 229
Chapter 11, “AIX client bare metal recovery” on page 239
Chapter 12, “Windows 2000 client bare metal recovery” on page 267
Chapter 13, “Linux client bare metal recovery” on page 289
Headquarters
Enterprise configuration
(configuration distribution)
Command Routing
Manager Servers
Chapter 7. TSM tools and building blocks for Disaster Recovery 139
You define the servers that you want the configuration manager to manage or
communicate with, and you set up communications among the servers.
On each server that is to receive the configuration information, you identify the
server as a managed server by defining a subscription to one or more profiles
owned by the configuration manager.
When you connect to the configuration manager via a Web browser, you are
presented with the enterprise console. From the enterprise console you can
perform tasks on the configuration manager and on one or more of the managed
servers. You can also connect to another server to perform tasks directly on that
server. As long as you are registered with the same administrator ID and
password, you can do this work on many servers without having to log on each
time.
From the command line of the administrative Web interface or from the
command-line administrative client, you can also route commands to other
servers. The other servers must be defined to the server to which you are
connected. You must also be registered on the other servers as an administrator
with the administrative authority that is required for the command. Command
routing enables an administrator to send commands for processing to one or
more servers at the same time. The output is collected and displayed at the
server that issued the routed commands. A system administrator can configure
and monitor many different servers from a central server by using command
routing. To make routing commands easier, you can define a server group that
has servers as members. Commands that you route to a server group are sent to
all servers in the group.
One or more servers can send server events and events from their own clients to
another server for logging. The sending server receives the enabled events and
routes them to a designated event server. This is done by a receiver that Tivoli
Storage Manager provides. At the event server, an administrator can enable one
or more receivers for the events being routed from other servers.
The source server is a client of the target server, and the data for the source
server is managed only by the source server. In other words, the source server
controls the expiration and deletion of the files that comprise the virtual volumes
on the target server. At the target server, the virtual volumes from the source
server are seen as archive data. The relationship between the source and target
TSM servers is illustrated in Figure 7-16.
U N IX L a rg e S ys te m s W in d o w s TS M C lie n ts
L A N /W A N
T C P /IP
(R e m o te )
(L o ca l)
TSM TSM
S o u rc e Ta rg e t
S e rv e r S e rve r
V irtu a l V o lu m es Arch ive
O b jec ts
P rim a ry S to ra g e
P oo l
D a ta b a se
B ac ku p
C o p y S to ra g e
P oo l
D a ta F lo w D isa ste r
R e co ve ry P la n
F ile
To use virtual volumes, the source server needs to define one or more device
classes of TYPE=SERVER, using the DEFINE DEVCLASS command. The device
class definition indicates on which remote or target server the volumes will be
Chapter 7. TSM tools and building blocks for Disaster Recovery 141
created. Having defined the device class, it can be used for primary or copy
storage pools, database backups and other virtual volume functions.
All data destined for virtual volumes is sent to the target server, using virtual
volumes rather than direct attached storage devices. For example, if a client is
backing up data which is bound to a backup copy group which is using a virtual
volume primary storage pool, this data will be sent to the target server. If a client
needs to restore the data, the source server gets the data back from the target
server. A TSM client can always perform the same granularity of restore, retrieve
or recall operation, whether the data is stored on a local TSM server, or on a
target server using server-to-server communication. That is, remote storage
pools (using server-to-server communication) are transparent to the client. The
only requirement is that the TCP/IP communication link between the source and
target server must be working correctly.
Note: Source server objects such as database and storage pool backups are
stored on the target server as archived data. Therefore, the target server
cannot directly restore these objects in the event of a disaster at the source
server site. In the event of a disaster at the source server site, the source
server should be re-installed (likely at an alternate location) and then objects
originally stored on the target server can be restored over the network, using
the same server-to-server communication.
The TSM Disaster Recovery Manager (now included with TSM Extended Edition
Version 5) provides full support of server-to-server virtual volumes for database
and storage pool backup. This function is required on a source server but not on
the target server.
For disaster recovery, server-to-server virtual volumes can be used to store the
Disaster Recovery Plan file remotely. In this strategy, the source server creates
the Disaster Recovery Manager plan files, then stores the files on a remote target
server. You can display information about recovery plan files from the server that
created the files (the source server) or from the server on which the files are
stored (the target server). To display plan file information on the target server you
can issue the QUERY RPFILE command, specifying the source server node name
that prepared the plan. You can easily display a list of all recovery plan files that
have been saved on a target server.
Chapter 7. TSM tools and building blocks for Disaster Recovery 143
This is clearly more time consuming than a simple copy storage pool
operation without using server-to-server communication.
Whether you will configure your system to include clusters depends on your
business needs. A cluster can provide system level high availability to ensure a
TSM server or client can continue normal backup and restore processes without
significant disruption for users and administrators. In addition to assuring the
right type of hardware and the applicable software, varying failover patterns
between cluster nodes exist and play different roles, for example, hot-standby
versus concurrent cluster operation. This section overviews TSM server and
client clustering support and configuration using IBM High Availability Cluster
Multi-Processing (HACMP) and Microsoft Cluster Server (MSCS).
LAN/WAN
TCP/IP
HACMP support for the TSM server on AIX has been officially supported since
V4.2. With V5.1 of TSM, several HACMP scripts are provided with the TSM
server fileset. These scripts can then be customized to suit the local
environment. When failover occurs, HACMP calls the TSM startserver script on
the standby node. The script verifies the devices, breaking the SCSI reserves,
and starts the server. On fallback, the stopserver script runs on the standby node,
which causes the TSM server to halt. Then the startserver script runs on the
Chapter 7. TSM tools and building blocks for Disaster Recovery 145
production node. HACMP handles taking over the TCP/IP address and mounting
the shared file systems on the standby node or production node, as appropriate.
By default, the startserver script will not start the Tivoli Storage Manager server
unless all the devices in the VerifyDevice statements can be made available.
However, you can modify the startserver script to start the server even if no
devices can be made available.
Both failover and fallback act as if a Tivoli Storage Manager server has crashed
or halted and was then restarted. Any transactions that were in-flight at the time
are rolled back, and all completed transactions are still complete. Tivoli Storage
Manager clients see this as a communications failure and try to re-establish
connection based on their COMMRESTARTDURATION and
COMMRESTARTINTERVAL settings. The backup-archive client can usually
restart from the last committed transaction. The clients and agents will behave as
they normally do if the server was halted and restarted while they were
connected. The only difference is that the server is physically restarted on
different hardware.
System
TSM Clients
Failure 1# Production 2# Failover
TSM Client TSM Client
machine-a machine-b
(HACMP) (HACMP)
Heartbeat
LAN/WAN
TCP/IP
TSM
Server
Data
Flow
Figure 7-18 HACMP and TSM client high availability configuration
The CLUSTERNODE option in the AIX client dsm.sys file determines if you want
the TSM client to back up cluster resources and participate in cluster failover for
high availability.
If a failover occurs during a user initiated (that is, non-scheduled) client session,
the TSM client starts on the node that is handling the takeover. This allows it to
process scheduled events and provide Web client access. You can install the
TSM client locally on each node of an HACMP environment. You can also install
and configure the TSM Scheduler Service for each cluster node to manage all
local disks and each cluster group containing physical disk resources.
Chapter 7. TSM tools and building blocks for Disaster Recovery 147
HACMP support for Hierarchical Storage Management (HSM) clients on AIX
provides support for HACMP failover on AIX so that HSM managed filesystems
can continue to operate in the case of an HACMP node failover and fallback.
For example, in the MSCS failover environment shown in Figure 7-19, a clustered
TSM server called TSMSERVER1 runs on node A and a clustered TSM server
called TSMSERVER2 runs on node B. Clients connect to the TSM server
TSMSERVER1 and the TSM server TSMSERVER2 without knowing which node
currently hosts their server. The MSCS concept of a virtual server ensures that
the server’s location is transparent to client applications. To the client, it appears
that the TSM server is running on a virtual server called TSMSERVER1. When
one of the software or hardware resources fails, failover occurs. Resources (for
example, applications, disks, or an IP address) migrate from the failed node to
the remaining node. The remaining node takes over the TSM server resource
group, restarts the TSM service, and provides access to administrators and
clients. If node A fails, node B assumes the role of running TSMSERVER1. To a
client, it is exactly as if node A were turned off and immediately turned back on
again.
LAN /W AN
TCP/IP
Cluster Interconnect
System
1# Production 2# F ailover
TSM Server T SM Se rver
Failure
(M S CS ) (M S CS)
T SM SE RV ER1 TS M SER VE R2
Clients experience the loss of all connections to TSMSERVER1 and all active
transactions are rolled back to the client. Clients must reconnect to
TSMSERVER1 after this occurs, which is normally handled as an automatic
attempt to reconnect by the TSM client. The location of TSMSERVER1 is
transparent to the client. A node can host physical or logical units, referred to as
resources. Administrators organize these cluster resources into functional units
called groups and assign these groups to individual nodes. If a node fails, the
server cluster transfers the groups that were being hosted by the node to other
nodes in the cluster. This transfer process is called failover. The reverse process,
failback, occurs when the failed node becomes active again and the groups that
were failed over to the other nodes are transferred back to the original node.
Two failover configurations are supported with MSCS and TSM: active/passive
and active/active. In the active/passive configuration you create one instance of a
TSM server that can run on either node. One system runs actively as the
production TSM server, while the other system sits passively as an online (hot)
backup. In the active/active configuration the cluster runs two independent
instances of a TSM server, one on each server. In the event of a system failure,
the server on the failed instance transfers to the surviving instance, so that it is
running both instances. Even if both instances are running on the same physical
server, users believe they are accessing a separate server.
Chapter 7. TSM tools and building blocks for Disaster Recovery 149
requires each TSM server instance to have a private set of disk resources.
Although nodes can share disk resources, only one node can actively control a
disk at a time. TCP/IP is used as the communications method in a MSCS
environment with TSM.
MSCS does not support the failover of tape devices. However, TSM can handle
this type of a failover pattern with the correct set up. TSM uses a shared SCSI
bus for the tape devices. Each node (two only) involved in the tape failover must
contain an additional SCSI adapter card. The tape devices (library and drives)
are connected to the shared bus. When failover occurs, the TSM server issues a
SCSI bus reset during initialization. In a failover situation, the bus reset is
expected to clear any SCSI bus reserves held on the tape devices. This allows
the TSM server to acquire the devices after the failover.
System
TSM Clients
Failure 1# Production 2# Failover
TSM Client
node-1 TSM Client node-2
(MSCS) (MSCS)
Cluster Interconnect
LAN/WAN
TCP/IP
TSM
Server
Data
Flow
Figure 7-20 MSCS and TSM client high availability configuration
In this example, the cluster contains two nodes: node-1 and node-2, and two
cluster groups containing physical disk resources. In this case, an instance of the
TSM Backup-Archive Scheduler Service should be installed for each node
node-1, node-2, and physical disk resources. This ensures that proper resources
are available to the Backup-Archive client when disks move (or fail) between
cluster nodes. The CLUSTERNODE option in the client option file ensures that
TSM manages backup data logically, regardless of which cluster node backs up a
cluster disk resource.
Chapter 7. TSM tools and building blocks for Disaster Recovery 151
Hardware disk replication technologies, such as IBM’s Peer-to-Peer Remote
Copy (PPRC) or EMC’s SRDF can be used to provide real-time (synchronous or
asynchronous) mirroring of the Tivoli Storage Manager database, log, or storage
pools to a remote site, as illustrated in Figure 7-21. In the event of a disaster, the
target volume at Datacenter #2 can be suspended and the volume pair
terminated which will return the target volume to the simplex state where the
secondary volume is now accessible to the host on the remote site. This online
copy of the TSM database at the remote site can be used to resume the TSM
environment. Later the volumes could be re-synched to re-establish a remote
mirror pair.
Large
UNIX Systems Windows
LAN/WAN
TCP/IP
TSM TSM
Server-A Server-B
FC FC
SAN SAN
Long distance SAN
FC DWDM FC
FC FC
WAN, IP based
router or channel
extender
Datacenter #1 Datacenter #2
If remote disk replication is used for mirroring of the TSM database and storage
pools, the TSM server could be recovered very quickly without any loss of client
data. A peer-to-peer configuration could be used to balance the load of TSM
services in the enterprise and provide data protection and rapid recovery for a
failure at either site. Various configurations with remote disk replication can exist.
For example, if only the TSM database and logs are mirrored remotely, recovery
of client data can begin once TSM copy pools are restored from tape. Electronic
tape vaulting, (see Figure 7-22), can be used along with remote disk replication
to improve recovery time and recovery point objectives. Optionally, data storage
Attention: TSM database updates may be broken into multiple I/Os at the
operating system level. Therefore, it is necessary to replicate all Tivoli
Storage Manager-managed mirrors of the Tivoli Storage Manager database to
the remote site. Replicating all database mirror copies will protect against a
broken chain of I/Os.
Chapter 7. TSM tools and building blocks for Disaster Recovery 153
to tape pools at the remote site (Datacenter #2), then the data can be migrated to
the tape storage pools at the primary site (Datacenter #1). See Figure 7-22.
Large
UNIX Systems Windows
LAN/WAN
TCP/IP
TSM TSM
Server-A Server-B
1# Remote Copy of
FC Primary Storage Pool FC
2# Migration
to Tape
Storage
Pool SAN SAN
Long distance SAN
FC FC
FC DWDM FC
WAN, IP based
router or channel
extender
Datacenter #1 Datacenter #2
Table 7-2 summarizes some of these. The use of these various technologies also
may depend on a particular vendor’s replication or vaulting solution. For example,
the IBMs TotalStorage Enterprise Storage Server uses PPRC to achieve data
replication. PPRC is supported via ESCON links which can be further extended
via DWDM or WAN channel extension.
Chapter 7. TSM tools and building blocks for Disaster Recovery 155
multi-mode fibre and is the ideal choice for shorter distance (less than 500m from
transmitter to receiver or vice versa).
WAN and IP based channel extenders typically use telecommunication lines for
data transfer and therefore enable application and recovery sites to be located
longer distances apart. The use of WAN and IP channel extenders provides the
separation for disaster recovery purposes and avoids some of the barriers
imposed when customers do not have a “right of way” to lay their fibre cable.
WAN and IP channel extenders generally compress the data before sending it
over the transport network, however the compression ratio needs to be
determined based on the application characteristics and the distance.
Network attached storage (NAS) and iSCSI solutions are beginning to offer low
cost IP based storage, such as the IBM TotalStorage IP 200i. Copies of TSM
storage pools and the TSM database can be storage at a remote site using IP
based storage to offer a low cost implementation while utilizing existing
infrastructure. Configurations can include TSM clients attached to iSCSI based
data backing up to a TSM server or TSM servers using iSCSI based storage as
storage pools.
For a detailed overview of technologies, products, costs and best practices with
distance solutions we recommend you review Introduction to SAN Distance
Solutions, SG24-6408.
With collocation disabled for a copy storage pool, typically there will be only a few
partially filled volumes after storage pool backups to the copy storage pool are
complete. Consider carefully before using collocation for copy storage pools.
Even if you use collocation for your primary storage pools, you may want to
disable collocation for copy storage pools. Or, you may want to restrict collocation
on copy storage pools to certain critical clients, as identified by the Business
Impact Analysis.
When an offsite volume is reclaimed, the files on the volume are rewritten to
another copy storage pool volume which is onsite. The TSM server copies valid
files contained on the offsite volumes being reclaimed, from the original files in
the primary storage pools. In this way, the server can reclaim offsite copy storage
pool volumes without having to recall and mount these volumes. Logically, these
files are moved back to the onsite location. The new volume should be moved
offsite as soon as possible. However the files have not been physically deleted
from the original offsite volume. In the event of a disaster occurring before the
newly written copy storage pool volume has been taken offsite, these files could
still be recovered from the offsite volume, provided that it has not already been
reused and the database backup that you use for recovery references the files on
the offsite volume. The server reclaims an offsite volume which has reached the
reclamation threshold as follows:
Chapter 7. TSM tools and building blocks for Disaster Recovery 157
1. The server determines which files on the volume are still valid.
2. The server obtains these valid files from a primary storage pool, or if
necessary, from an onsite volume of a copy storage pool.
3. The server writes the files to one or more volumes in the copy storage pool
and updates the database. If a file is an aggregate file with unused space, the
unused space is removed during this process.
4. A message is issued indicating that the offsite volume was reclaimed.
5. The newly written volumes are then marked to be sent offsite, and after this
has occurred, the reclaimed volume can be returned to an onsite scratch
pool.
Volumes with the access value of offsite are eligible for reclamation if the amount
of empty space on a volume exceeds the reclamation threshold for the copy
storage pool. The default reclamation threshold for copy storage pools is 100%,
which means that reclamation is not performed.
If you plan to make daily storage pool backups to a copy storage pool, then mark
all new volumes in the copy storage pool as offsite and send them to the offsite
storage location. This strategy works well with one consideration — if you are
using automatic reclamation (the reclamation threshold is less than 100%). Each
day’s storage pool backups will create a number of new copy storage pool
volumes, the last one being only partially filled. If the percentage of empty space
on this partially filled volume is higher than the reclaim percentage, this volume
becomes eligible for reclamation as soon as you mark it offsite. The reclamation
process would cause a new volume to be created with the same files on it. The
volume you take offsite would then be empty according to the TSM database. If
you do not recognize what is happening, you could perpetuate this process by
marking the new partially filled volume offsite.
If you send copy storage pool volumes offsite, we recommend that you control
copy storage pool reclamation by using the default value of 100. This turns
reclamation off for the copy storage pool. You can start reclamation processing at
desired times by changing the reclamation threshold for the storage pool.
Depending on your data expiration patterns, you may not need to do reclamation
of offsite volumes each day. You may choose to perform offsite reclamation on a
less frequent basis. For example, suppose you send copy storage pool volumes
to and from your offsite storage location once a week. You can run reclamation
for the copy storage pool weekly, so that as offsite volumes become empty they
are sent back for reuse.
When you do perform reclamation for offsite volumes, the following sequence is
recommended:
This sequence ensures that the files on the new copy storage pool volumes are
sent offsite, and are not inadvertently kept onsite because of reclamation.
Chapter 7. TSM tools and building blocks for Disaster Recovery 159
160 Disaster Recovery Strategies with Tivoli Storage Management
Part 2
Part 2 Implementation
procedures and
strategies
Having covered planning strategies, this part of the book provides practical
procedures for protecting and restoring TSM servers and clients. Disaster
Recovery Manager (DRM) is discussed in depth, including how to set it up, how
to maintain the plan, and how to recover the TSM server using the plan. Next,
procedures for bare metal recovery on popular operating systems are described,
including Windows 2000, Linux, Solaris and AIX. Finally, we draw together many
of the threads in this book to present some complete disaster protection
implementation ideas and case studies.
How to choose one TSM server platform over another? Sometimes there will not
be a choice because of system availability, historical reasons, or corporate policy.
However if a new server is to be commissioned, then there are certain factors to
consider. With only minor differences, a TSM server provides the same
functionality on every platform at the same version and release level. From the
TSM client perspective, there is no major difference between any TSM server —
they will provide the same services, even for a bare metal recovery. However
there are some differences dictated by operating system platform which affect
overall costs, administration, and operations. Every platform has different
procedures for bare metal recovery or re-installation of the operating system.
Some systems are more scalable or more versatile in terms of peripheral
support. If it is desired to use clustered TSM systems or a hot backup system,
the question of cost can be important. Table 8-1 summarizes these
considerations.
The operation of the TSM server itself is fairly consistent across all platforms —
the administrative commands and interfaces are the same. The largest
differences are in attaching and installing devices, and in managing disk volumes
for database and log volumes, and disk storage pools. If there are already
Supported devices
We have already frequently mentioned the enormous growth in the amount of
enterprise data. Storage devices and technologies are being continually
upgraded to keep abreast with this growth. There are a wide variety of supported
devices available on the Windows and UNIX platforms, including disk drives, disk
arrays, tape drives, optical drives and automated tape libraries. OS/400 and
z/OS tend to have a more restricted range. For a list of supported devices on
each TSM server platform, see the Web site:
http://www.tivoli.com/support/storage_mgr/devices/all.html
It is also important to carefully check support for SAN-attached devices for using
functions such as tape library sharing, LAN-free backup and server-free backup.
In general, most new sites will choose either Windows or one of the UNIX
platforms, as these are somewhat easier to administer, and skilled staff are
readily available. Companies with significant OS/400 investment and skill will be
more comfortable with an OS/400 platform. If you have skilled staff in either
Windows or particular UNIX-variants, choose those platforms. For sites with little
or no commitment and skills to other platforms, choose AIX. It is easy to install or
restore, and robust enough to handle most implementations while remaining
relatively cost effective. AIX-based TSM servers are very well supported by IBM
in the operating system, TSM and DR Planning. The AIX platform scales from
small to very large TSM systems.
The main reasons why we suggest the biggest, fastest, and most automated
tape library are:
Constant growth in quantity of data. Forecasted data capacity is reached
often in half the projected time period. Retention requirements for data can
also increase, requiring more media storage.
As data volumes increase, backup windows are shortening, and the
availability of dedicated backup windows (data offline) can even be zero in
today’s 24x7, mission-critical environments.
Automation helps reduce overall costs, increases reliability (less human error)
and makes the system as a whole faster.
Creating copy storage pools for offsite storage requires at least two drives in a
library. With multiple drives, housekeeping operations such as reclamation are
also much easier. The TSM backup client and many of the API applications (for
example, DR backup for SAP, RMAN for Oracle) are able to automatically and
simultaneously stream the data to multiple drives, providing efficient utilization of
multi-drive libraries. For better data and system availability, we therefore
recommend usually at least three drives in a library. This provides better
availability in the event of failure of one drive and allows more concurrent TSM
operations to occur. Detailed TSM planning will determine the optimum number
required.
With large amounts of data, the 4mm, 8mm, and QIC tapes are a good less
suitable fit and they do not have large capacity. They are also becoming
increasingly less common. DLT is a quite common tape system, and may store
large amounts of data, but also has a number of restrictions. These tape formats
should only be recommended for smaller systems.
LTO drives and tape libraries (IBM 358x product line) are a good choice if there
is no requirement to attach to a mainframe. This tape technology offers high
performance and reliability and can connect to multiple operating systems for
library sharing. IBM 3584 LTO library provides coexistence with DLT drives,
therefore protecting the present investment in DLT technology.
The IBM 3590 class type system is a high-end system usable in any
environment, including mainframe. The 3590 technology offers high speed,
capacity and reliability.
SANs have altered the nature of distributed systems storage. SANs create new
methods of attaching storage to processors by repositioning storage devices
onto a shared network and by providing new, improved data access methods to
this storage. These new connectivity and access options promise improvements
in both data availability and performance. One operational area benefiting from
the development of SAN technology is data backup/restore and archive/retrieve.
For backup/restore operations, SANs allow:
TSM has progressively included SAN exploitation in the product, starting with
tape library sharing, first made available in V3.7. IBM TSM Extended Edition
V5.1 now includes the base tape library sharing, plus LAN-free backup to both
tape and disk (using Tivoli SANergy) and server-free backup. LAN-free and
server-free backup allows the data traffic to be off-loaded to the SAN, instead of
moving the data through file servers over the LAN. This reduces LAN and server
overhead, minimizing the impact of backups on the efficiency of the network.
The recovery log is used by the server to keep a record of all changes to the
database. When a change occurs, the recovery log is updated with some
transaction information before the database is updated. This enables
uncommitted transactions to be rolled back during recovery so the database
remains consistent.
The TSM recovery log also consists of one or more recovery log volumes on a
disk. Access to the recovery log is predominately write-oriented, with the writes
and the few reads clustered together. The writes are done in moving cursor
format which does not lend itself to multiple volume organization. Therefore,
fewer recovery log volumes are normally required. Mirroring of the recovery log is
highly recommended, even if database mirroring is not done (usually for reasons
of the cost of disks). Recovery log volumes can also be mirrored within TSM
(2-way or 3-way). Figure 8-3 shows a TSM recovery log configured with 3-way
mirroring on the Windows platform. Each mirror copy is on a separate physical
volume.
The TSM server can use one of two log modes for recovery log: NORMAL or
ROLLFORWARD. The log mode determines how log TSM saves records in the
recovery log and the kind of database recovery you can use. When log mode is
set to NORMAL, TSM saves only those records needed to restore the database
to the point of the last backup. TSM deletes any unnecessary records from the
recovery log. Changes made to the database since the last backup cannot be
recovered. In NORMAL log mode, you may need less space for the recovery log,
because TSM does not keep records already committed to the database. In the
ROLLFORWARD log mode TSM saves all recovery log records that contain
changes made to the database since the last time it was backed up. Log records
are deleted only after a successful database backup. In this mode the database
can be restored to the most current state (rollforward recovery) after loading the
most current database backup series. Using ROLLFORWARD log mode may
require a significant amount of space to record all activity.
A full backup takes longer to run than an incremental backup, because it copies
the entire database. However, with the full backup you can more quickly recover
your TSM server because only one set of volumes needs to be loaded to restore
the entire database. Incremental backup takes less time to run because it copies
only database pages that have changed since the last time the database was
backed up. However, this increases the restore time because you must first
restore the last full backup followed by all the incremental backups.
Snapshot backup is a full backup that does not interrupt the full plus incremental
backup series. Snapshot backups are usually shipped offsite (with the assistance
of DRM) while regular full and incremental backups are kept onsite.
Smaller size for recovery log. Database can be restored to most current
state.
Cannot restore to most current state. Single database volume can be restored
Using the parameters Migration Delay and Cache Migrated Files can be useful
for some DR cases, particularly if the disk storage pool is located on a storage
system with a remote copy. Important files can therefore be kept in a disk storage
pool for quick access.
The TSM server has the ability to collocate client data on tape volumes. When
files are moved to a collocated storage pool, TSM ensures that the files for a
specific client are written to the same set of tapes. This can limit the number of
tapes that must be mounted when restoring that client’s system. Collocation can
be done at the client level or by individual filespace. When you are deciding
whether or not to enable collocation, keep in the mind:
Non-collocation increases performance on backups because TSM does not
have to select specific tapes.
Collocation increases performance on restores because client data is
confined to their own dedicated set of tapes. Therefore there is less necessity
to “skip” data not needed for the restore.
For lower Tier disaster solutions, we recommend that you maintain two copy
storage pools — one kept onsite, and one taken offsite. The onsite copy storage
pool will provide availability in the case of media failure in primary storage pool.
The offsite copy storage pool protects data in the case of a disaster. Alternatively,
a SAN-connected tape library in a remote location (using electronic tape
vaulting) reduces the need for two copy pools, because the copy pool on this
library combines the benefits of the onsite and offsite storage pools. Naturally,
the performance of restoring data from an offsite SAN-attached tape library
should be carefully benchmarked to ensure it will meet the RTO.
Given that one to many primary storage pools can be configured in TSM, we
recommend that you backup all storage pools in each single hierarchy to the
same copy storage pool. This can help recovery in the case of disaster. More
than one storage pool hierarchy can be backed up to the same copy storage
pool.
In the event of a disaster, the TSM database must first be updated on the standby
TSM server — basically this operation is a DSMSERV RESTORE DB operation. In
order not to interfere with the database backup series of the primary TSM server
a point in time restore should be done with a database snapshot backup. Unless
a current volume history file with this database snapshot backup entry is
available, a restore without volume history file operation will have to be done. The
database snapshot backup should be done immediately after the primary TSM
Requirements
Same operating system version and PTF level.
Same TSM version and PTF level.
Access to tape hardware compatible with the primary server tape format
Equal or greater database and storage pool volume capacity to the primary
TSM server.
TCP/IP address on the standby server must not be the same as the primary
while primary is active, but must have the ability to assume the primary
server’s TCP/IP address in a disaster or change client
TCPSERVERADDRESS parameters in the client options file on all TSM
clients.
The next two examples show a hot standby TSM server where there is automatic
synchronization of the server database using PPRC or other remote mirroring
technique. First, Figure 8-7, which shows a remote TSM server running as a hot
standby. The TSM database volumes, recovery log, disk storage pools, and
configuration files reside on a mirrored disk array. This scenario allows the
standby TSM server to assume the primary TSM server responsibilities in a
minimal amount of time.
Figure 8-8 shows a more sophisticated setup, where two TSM servers each
contain a hot instance of each other which can be used for DR in the event of
either one failing. In this scenario, in normal operations, the two TSM servers run
separate, independent workloads at separate sites. Each server’s database and
storage pools is mirrored remotely, with a standby TSM instance installed also. If
a failure took out one site, the surviving site would then be able to run both TSM
servers on the same system.
Requirements
Same operating system version and PTF level.
Same TSM version and PTF level.
TCP/IP address on the standby server must not be the same as the primary
while primary is active, but must have the ability to assume the primary
server’s TCP/IP address in a disaster or change client
TCPSERVERADDRESS parameters in the client options file on all TSM
clients.
TCP/IP addressing
You have to make sure that either the recovery machine has the same TCP/IP
address as the original TSM server, or that all TSM client option files have to be
updated.
Configuration files
Configuration files need to be made available on the standby servers – they
include devconfig, volhistory, and dsmserv.opt.
Summary
IBM Tivoli Storage Manager and Disaster Recovery Manager can provide a
method to restore the production TSM server for backups/restores and Space
management. What needs to be considered is what scenario is best for you,
cold, warm or hot standby. This will be driven the needs of your business.
IBM Tivoli Storage Manager incorporates features that can be exploited to give a
range of protection for the storage management server. This extends from
hardware disk failure, through to immediate fail-over to alternative hardware in
another location; without loss of stored data or interruption to the service.
8.5.1 Failover
The TSM server fully supports the process of failover to alternate hardware in the
event of failure of the TSM server hardware, where the servers are clustered
using the applications IBM HACMP or MSCS from Microsoft. To achieve this, the
TSM server database and storage pools are allocated on shared disk between
the primary and fail-over servers. The fail-over product monitors the TSM server
process and hardware. In the event of failure of the primary server the fail-over
process restarts the TSM server on the fail-over server.
Where a mirrored copy of the TSM server database available, as provided by the
DBMIRRORWRITE SEQUENTIAL or PARALLEL and DBPAGESHADOW
options, the TSM server can be restarted immediately; without the loss of
service. Any currently scheduled and executing client or server tasks and
operations will restart and continue.
For the Microsoft environment, TSM is a fully cluster-aware application that fully
integrates with and takes advantage of MSCS’s clustering and administrative
capabilities. TSM uses MSCS fail-over to offer two configurations:
Active/passive configuration
In this configuration one instance of a TSM server is created that can run on
either node. The active node performs the normal TSM functions and the passive
node serves as an online (hot) backup. In this configuration, the server has one
database, recovery log, and one set of storage pool volumes.
Active/active configuration
This configuration enables the cluster to support two independent instances of a
TSM server. Although the instances typically run on separate nodes, one node
can run both instances. Each TSM server has a separate database, recovery log,
Tape device failover can be achieved between two servers in a cluster by dual
attaching the tape devices to both server platforms. A number of different
configurations are possible:
IBM 3590 tape drives have dual SCSI ports, enabling each drive to be
separately cabled to each server.
Drives with only a single SCSI port can be connected using dual ended SCSI
cables with the device connected between both primary and fail-over Tivoli
Storage Manager servers.
In a Fibre Channel environment the drives can be zoned to both servers.
Using this approach, the TSM server can be restarted immediately following a
disaster. Consequently, when access has been provided to the offsite storage
pool copy tapes, recovery of client systems can commence. It provides a
recovery point that is determined by the last time offsite tapes were recreated
and sent to the recovery site.
Further resilience is achieved by replicating the TSM disk and tape storage pools
in real time to the remote site as well as the database and logs. This provides
failover without any loss of data at any level in the storage hierarchy.
To enable this in real-time, network access to the tape libraries at the remote site
is required to store the offsite copy of the data. There are two approaches to this:
use of TSM’s virtual tape vaulting over a TCP/IP network, or by having a Fibre
Database Management Manages and tracks TSM Lets you know exactly
server database and what has been done and
storage pool backup what may still need to be
volumes. done.
You can check if license was registered correctly using the command QUERY
LICENSE.
If we look at this new storage pool, using the QUERY STGPOOL command, we can
see that it has been created as a copy storage pool, shown in Figure 8-3.
DRM settings
Now we will configure DRM with some basic settings.
1. First we want to specify a directory where the DR recovery plans will be
stored, using the SET DRMPLANPREFIX command. You can specify the plan
prefix with a full directory path. Using the form shown in Example 8-4, our DR
plans as generated by DRM will be stored in the directory C:\DRM\PLAN\ and
each file will be prefixed by the string RADON. If we did not use this
command, a default path would be used, which is the directory where the
instance of the TSM server is running.
2. Next we set the prefix for where the recovery instructions will be stored. The
DRM PREPARE command will look for these files in this directory. This is done
using the SET DRMINSTRPREFIX command. You can specify the instructions
prefix with a full directory path. Using the form shown in Example 8-5, the
recovery instruction files should be located in the directory
C:\DRM\INSTRUCTION\ with each file prefixed by the string RADON. The
prefix does not need to be specified — if we did not use this command, a
default path would be used, which is the directory where the instance of the
TSM server is running. The recovery instruction files are user generated and
can contain any instructions related to the DR process. You can create those
files using any plain text editor, and include the information which is relevant
to your installation. Instruction files will be automatically included in the DRM
plan. The standard names for the instruction plans (without prefix) are
– RECOVERY.INSTRUCTIONS.GENERAL - for general information such as
the system administrator and backup names and contact details, and
passwords.
– RECOVERY.INSTRUCTIONS.OFFSITE - for information about the offsite
vault location and courier, including name, phone number, e-mail, fax,
after-hours pager, and so on.
– RECOVERY.INSTRUCTIONS.INSTALL - for TSM server installation
instructions such as passwords, hardware/software requirements, fix
levels, and so on.
3. Next, we will specify a character which will be appended to the end of the
replacement volumes names in the recovery plan file. This is done using the
SET DRMPLANVPOSTFIX command. Use of this special character means you
can easily search and replace these names for the replacement primary
storage pool volumes to your desired names before the recovery plan
executes. In Example 8-6, we are using the default character of @.
4. Now, use the SET DRMCHECKLABEL command to specify whether TSM will read
the labels of tape media when they are checked out using the MOVE DRMEDIA
command (Example 8-7). The default value is YES.
6. Use the SET DRMCOPYSTGPOOL command to indicate one or more copy storage
pools to be managed by DRM. Those copy storage pools will be used to
recover the primary storage pools after a disaster. The MOVE MEDIA and QUERY
DRMEDIA commands will process volumes in the copy storage pools listed here
by default (unless explicitly over-ridden with the COPYSTGPOOL parameter).
7. Next, use the SET DRMCOURIERNAME command to define the name of your
courier company. If not set, this will use the default value of COURIER. Any
string can be inserted here (see Example 8-10).
10.Next, specify the location where media will be stored while it is waiting to be
sent to the offsite location, using the SET DRMNOTMOUNTABLENAME command.
This location name is used by the MOVE DRMEDIA command to set the location
11.The SET DRMRPFEXPIREDAYS command sets the number of days after creation
that a recovery plan file which has been stored on a target server (using
server-to-server communication) will be retained. This command and
expiration processing only applies to recovery plan files that are created with
the DEVCLASS parameter on the PREPARE command (that is, virtual volumes
of type RPFILE and RPSNAPSHOT). The most recent files are never deleted.
Example 8-14 shows changing this value from the default of 60 days back to
30 days.
12.You can identify the vault name with the SET DRMVAULTNAME command, as
shown in Example 8-15. You can specify any string or use the default value of
VAULT.
We will now describe these steps in detail. Throughout this section, we refer to
the possible states which a piece of DR media could be in. The DR media
consists of the volumes used for storage pool and database backup. The states,
and their normal life cycle is shown in Figure 8-10. Media changes state by use
of the MOVE DRMEDIA command.
ONSITE OFFSITE
COURIER
DRP
VAULT
TSM Storage Pools NOTMOUNTABLE
DRP
VAULTRETRIEVE
DB
TSM Database
COURIER
RETRIEVE
When the copy tapes are initially made, they are in state MOUNTABLE, meaning
they are available to the TSM server. After they are ejected from the library and
are ready to be picked up for transport by the Courier, they transition to the
Now we want to move the DR media to the VAULT location. First, we should
check, if any of the required volumes are still mounted. We can do this using the
QUERY MOUNT command. If yes, we can dismount them with the DISMOUNT VOLUME
command as shown in Example 8-21.
This has processed both volumes and marked them as state NOTMOUNTABLE
since they are no longer available in the tape library. Their location has been
changed to the value specified in SET DRMNOTMOUNTABLENAME (as described in step
10 on page 197). At this stage, the courier would arrive to collect the volumes.
Once the volumes have been signed over to the courier, we need to indicate this
state change to DRM, using again the MOVE DRMEDIA command, shown in
Example 8-23. This command will change the state of the DR media volumes
from NOTMOUNTABLE to COURIER, and their location to the value specified in
SET DRMCOURIERNAME (as described in Step 7 on page 197), indicating they are in
transit.
Finally, when the vault confirms that the DR media have safely arrived, use the
MOVE DRMEDIA command one more time, as shown in Example 8-24. This sets
the state of the DR media volumes to VAULT, and the location to the value
specified in SET DRMVAULTNAME (as described in Step 12 on page 198).
Figure 8-11 shows the process we have used to backup the storage pools and
database, and send the media offsite.
TSM
Server Backup the
on Database
RADON
ABA927L1
Machinename: brazil
Building: 6
Priority: 50
Move DR Media
TSM Database Offsite VAULT
Figure 8-11 Daily operations - primary pools backup and TSM database backup
Client
Copy
DISASTER
D
TSM Server Prepare RECOVERY
R
M PLAN
C:\DRM\PLAN\yyyymmdd.hhmmss
DB
We generate the recovery plan using the PREPARE command, as in Example 8-26.
The plan is now stored in a time-stamped file in the local directory with prefix as
defined in SET DRMPLANPREFIX, as shown in Step 1 on page 195. The file created
here is called C:\DRM\PLAN\RADON.20020726.183058. A complete listing of
the recovery plan output is provided in Appendix C, “DRM plan output” on
page 353. We recommend for safety, that you create multiple copies of the
recovery plan, stored in different locations. You can create a remote copy of the
DRP by specifying a DEVCLASS on the PREPARE command. This DEVCLASS
can only be of type SERVER, and is used to store the DRP on a target server
Send this list to the vault for return to the primary location. When you are notified
that the volumes have been given to the courier, this state change can be
reflected in DRM by using:
Returned media can then be inserted back into the library for reuse with the
CHECKIN LIBVOL command. Note that you can also use options on the MOVE
DRMEDIA command for DR media in the COURIERRETRIEVE state to
automatically generate a macro of CHECKIN LIBVOL commands. Refer to the TSM
Administrator’s Reference for more details.
The plan should be copied back to the same directory as it was stored on the
original TSM server when it was created.
DISA STER
RE CO V ERY
PLAN
C:\D R M \PLA N\
Instruction Files
C o py planexp l.vbs
C om m and Files
C M D Input Files
You can use a text editor to manually divide the recovery plan into its
components, or use sample scripts shipped with TSM. For Windows, a sample
VBScript is in planexpl.vbs, shipped with DRM. For UNIX, a sample awk script is
in planexpl.awk.smp. You should keep a copy of these scripts offsite along with
the recovery plan. We recommend for you to be familiar with executing the
scripts, as the plan will be large and doing a manual breakout will be
time-consuming and prone to errors.
TSM
Restore the Server
Database on
LEAD
ABA927L1
Machinename: brazil
Building: 6
Priority: 50
ABA926L1
Example 8-30 shows the command output from breaking out the DRP on the
replacement server. You can see that many smaller files are created.
Planfile: C:\DRM\plan\RADON.20020726.183058
set planprefix to C:\DRM\PLAN\RADON.
Creating file C:\DRM\PLAN\RADON.SERVER.REQUIREMENTS
Creating file C:\DRM\PLAN\RADON.RECOVERY.INSTRUCTIONS.GENERAL
Creating file C:\DRM\PLAN\RADON.RECOVERY.INSTRUCTIONS.OFFSITE
Creating file C:\DRM\PLAN\RADON.RECOVERY.INSTRUCTIONS.INSTALL
Creating file C:\DRM\PLAN\RADON.RECOVERY.INSTRUCTIONS.DATABASE
Creating file C:\DRM\PLAN\RADON.RECOVERY.INSTRUCTIONS.STGPOOL
Creating file C:\DRM\PLAN\RADON.RECOVERY.VOLUMES.REQUIRED
Creating file C:\DRM\PLAN\RADON.RECOVERY.DEVICES.REQUIRED
Creating file C:\DRM\PLAN\RADON.RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE.CMD
Creating file C:\DRM\PLAN\RADON.RECOVERY.SCRIPT.NORMAL.MODE.CMD
Creating file C:\DRM\PLAN\RADON.LOG.VOLUMES
Creating file C:\DRM\PLAN\RADON.DB.VOLUMES
Creating file C:\DRM\PLAN\RADON.LOGANDDB.VOLUMES.INSTALL.CMD
Creating file C:\DRM\PLAN\RADON.LICENSE.REGISTRATION.MAC
Creating file C:\DRM\PLAN\RADON.COPYSTGPOOL.VOLUMES.AVAILABLE.MAC
Creating file C:\DRM\PLAN\RADON.COPYSTGPOOL.VOLUMES.DESTROYED.MAC
Creating file C:\DRM\PLAN\RADON.PRIMARY.VOLUMES.DESTROYED.MAC
Creating file C:\DRM\PLAN\RADON.PRIMARY.VOLUMES.REPLACEMENT.CREATE.CMD
Creating file C:\DRM\PLAN\RADON.PRIMARY.VOLUMES.REPLACEMENT.MAC
Creating file C:\DRM\PLAN\RADON.STGPOOLS.RESTORE.MAC
Creating file C:\DRM\PLAN\RADON.VOLUME.HISTORY.FILE
Creating file C:\DRM\PLAN\RADON.DEVICE.CONFIGURATION.FILE
Creating file C:\DRM\PLAN\RADON.DSMSERV.OPT.FILE
Creating file C:\DRM\PLAN\RADON.LICENSE.INFORMATION
Creating file C:\DRM\PLAN\RADON.MACHINE.GENERAL.INFORMATION
Creating file C:\DRM\PLAN\RADON.MACHINE.RECOVERY.INSTRUCTIONS
Creating file C:\DRM\PLAN\RADON.MACHINE.CHARACTERISTICS
Creating file C:\DRM\PLAN\RADON.MACHINE.RECOVERY.MEDIA.REQUIRED
C:\TSMDB\TSMLOG01.DB
C:\TSMDB\TSMDB01.DB
C:\TSMDB\TSMDB02.DB
begin RECOVERY.VOLUMES.REQUIRED
end RECOVERY.VOLUMES.REQUIRED
We can see that volume ABA926L1 has a copy storage pool entry, whereas
volume ABA927L1 does not. This is because ABA927L1 was used for a
database backup. We can confirm this by looking at the volume history section of
the recovery plan (VOLUME.HISTORY.FILE stanza). We will need the volume
ABA927L1 initially for restoring the TSM database.
If some required storage pool backup volumes could not be retrieved from the
vault, remove the volume entries from the
COPYSTGPOOL.VOLUMES.AVAILABLE file.
rem Purpose: Create replacement volumes for primary storage pools that
rem use device class DISK.
rem Recovery administrator: Edit this section for your replacement
rem volume names. New name must be unique, i.e. different from any
rem original or other new name.
IBM 3583 Scalable LTO tape library IBM 3583 Scalable LTO tape library
The IBM 3583 is an automated library and we will have to manually place the
database backup volumes into the library (since there is no TSM server to check
them in) and update the configuration information to identify the element within
the library where the volumes are placed. This allows the server to locate the
required database backup volumes. In Example 8-34 we added a line to the
device configuration file (DEVICE.CONFIGURATION.FILE stanza in the DR
Plan) with the location of tape volume ABA927L1, and the actual element
address 0016. This element address corresponds to I/O station slot 1 in the
library. For information on the element addresses for your particular devices,
consult your tape library vendor documentation and the Tivoli Web site on device
support:
http://www.tivoli.com/support/storage_mgr/requirements.html
We changed the DEFINE statements for the library and paths to reflect the actual
new device special files for the library and tape drives.
end DEVICE.CONFIGURATION.FILE
ANR8200I TCP/IP driver ready for connection with clients on port 1500.
ANR0200I Recovery log assigned capacity is 128 megabytes.
ANR0201I Database assigned capacity is 2048 megabytes.
ANR4600I Processing volume history file C:\PROGRA~1\TIVOLI\TSM\SERVER1\VOLHIST.
OUT.
ANR4620I Database backup series 7 operation 0 device class CLASS1.
ANR4622I Volume 1: ABA927L1.
ANR4634I Starting point-in-time database restore to date 07/26/2002 17:23:38.
ANR8337I LTO volume ABA927L1 mounted in drive MT01 (mt0.2.0.4).
ANR1363I Input volume ABA927L1 opened (sequence number 1).
ANR4646I Database capacity required for restore is 1024 megabytes.
ANR4649I Reducing database assigned capacity to 1024 megabytes.
ANR4638I Restore of backup series 7 operation 0 in progress.
ANR4639I Restored 15872 of 31905 database pages.
ANR4639I Restored 31808 of 31905 database pages.
ANR4640I Restored 31905 pages from backup series 7 operation 0.
ANR0306I Recovery log volume mount in progress.
ANR4641I Sequential media log redo pass in progress.
ANR4642I Sequential media log undo pass in progress.
ANR1364I Input volume ABA927L1 closed.
ANR4644I A full backup will be required for the next database backup operation.
ANR4635I Point-in-time database restore complete, restore date 07/26/2002
17:23:38.
ANR8468I LTO volume ABA927L1 dismounted from drive MT01 (mt0.2.0.4) in library
LB6.0.0.3.
Wait for the server to start. Ensure that the Administrative command
line client option file is set up to communicate with this server, then
press enter to continue recovery script execution.
Press any key to continue . . .
At this stage, a second command window is opened which starts the TSM server
as shown in Example 8-36.
The scripts have finished successfully, the TSM database was restored, and the
TSM server starts. In our case, the library and drive paths has been altered in the
DEVICE.CONFIGURATION.FILE, which causes the Unable to open device
error. To correct this, we update the library and drives paths directly in that
command window as shown in Example 8-37.
Once the device configuration is set correctly you can mount copy storage pool
volumes upon request, check in the volumes in advance, or manually place the
volumes in the library and ensure consistency by issuing the AUDIT LIBRARY
command.
Note: This action is optional because TSM can access the copy storage pool
volumes directly to restore client data. Using this feature, you can minimize
client recovery time, because server primary storage pools do not have to be
restored first.
Enter the script file name at the command prompt and follow with the
administrator name and password as parameters. This script creates
replacement primary storage pool volumes, defines them to TSM and restores
them from the copy storage pool volumes.
5698-ISE (C) Copyright IBM Corporation 1990, 2002. All rights reserved.
U.S. Government Users Restricted Rights - Use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM Corporation.
As an alternative, you can use the recovery script as a guide and manually run
each step.
The ongoing results of the TSM server restore process using the
DRM-generated scripts is logged in the following files.
For some of these platforms, there may be more than one option for bare metal
recovery. Therefore, this short chapter will provide a framework of concepts to
help guide when one or more of these solutions may be used. This chapter will
also discuss some common alternatives to a full-blown bare metal recovery
(BMR) solution.
Keep in mind that there may be some hosts that are of such criticality to your
organization that using backup data to recover from is too slow. In those
instances solutions such as replicated data or applications will provide faster
recovery, but incur more cost. This concept has been discussed several times
within the DR chapters. In all cases, the value of a host to the organization will
guide the type of DR support.
What will differentiate the types of BMR solutions will be how each of these steps
is accomplished.
While this has the advantage of little to no up front cost, it requires more manual
effort and an estimated half hour to one and a half hours per host to install the
operating system, before it is possible to begin restoring the actual data
(depending on the operating system). In the case of more complex UNIX hosts,
this time may be significantly longer. This type of solution may be suitable for a
small number of critical hosts or larger numbers of non-critical hosts.
After a fast restore of the operating system from the additional media, the normal
backup engine is accessed to restore the entire machine to its last state. The
third party software in this case can be imaging software or other specialized
clients. Enterprise-class UNIX operating systems often have this software
included as a utility in the base configuration (for example, the mksysb command
on AIX). By using a generic operating system image simply as a bootstrap into
the BMR process, this image can serve many hosts therefore, avoiding the
administrative overhead of keeping current and complete images for all hosts.
The use of a minimal operating system installed recovered to its own partition (or
secondary operating system that was already installed) can be used in
combination with hot image backups from the enterprise backup/restore product.
Hot image backups can be performed on a running server (including the
operating system partition) but these hot backups cannot be restored onto the
same partition currently running the operating system. By using another partition
to run a minimal operating system and the backup/restore software, that image
can be restored to another inactive partition. Restoring those files that could not
be backed up during the hot image backup will complete the BMR. This data
would include system configuration databases, registries, directories, and so on.
10.1.2 Backup
The most frequently used utilities for system backup and restore on the Solaris
platform are ufsdump and ufsrestore. Solaris Version 7 also provides other
utilities for backup and restore of files, and file systems. Commands for backing
up and restoring files and file systems are summarized in Table 10-1.
For the purpose of this test we used the default include/exclude list for a TSM
UNIX client. We also set the copy group to use shared static option to avoid
fuzzy backups. For more information about using and configuring the TSM client
on Sun Solaris, see Tivoli Storage Manager for UNIX Backup-Archive Clients
Installation and User’s Guide, GC32-0789.
Hardware configuration
Use the command /usr/platform/‘uname -i‘/sbin/prtdiag -v command to
collect information about the machine. Other useful commands are:
– prtconf -v
– psrinfo -v
– isainfo -v
– dmesg
– iostat -En
– netstat -in
All disk, and partitioning information should be recorded and stored before the
disaster in either hardcopy or within DRM. Information should be gained by the
commands df or prtvtoc.
3. Check the file system for consistency with the fsck command, shown in
Example 10-2. Running the fsck command checks for consistency of file
systems in case something like a power failure has left the files in an
inconsistent state.
4. Identify the device name of an attached tape drive. We are using /dev/rmt/0.
5. Insert a tape that is not write protected into the tape drive.
6. Back up file systems using the ufsdump command. We need a full backup of
root file system. This is specified by option 0, as in Example 10-3.
Before you start to restore files or file systems, you need to know or have:
Which tapes you need
The raw device name on which you want to restore the file system
The type of tape drive you will use
The device name for the tape drive
Bootable Solaris environment CD
When a disaster strikes, first of all we need to restore the / and /usr file systems.
Than we can restore user data using TSM client. Here is the procedure.
1. Boot from the Solaris 7 CD to single user mode as shown in Example 10-4.
2. Format a partition as ufs file system where root directory will be restored.
3. Check the partition where the root partition will be restored as shown in
Example 10-6.
4. Now that we have a partition ready to restore the operating system, mount it
and change directory to it, as in Example 10-7.
5. Now we are ready to restore partition from the tape using the ufsrestore
command as in Example 10-8.
9. We can one more time check disk partition using fsck command; see
Example 10-6 on page 236.
10.Repeat Steps 2 through 7, to restore the /usr partition. You should use the
same partition device name as previously so this will match the backup
information on the root partition. Use the tape made with ufsdump of the /usr
file system to restore it.
You have now restored your root file system. This should include the TSM client
executable and configuration files, presuming they were installed in the root file
system as well. If so, you can then use the TSM client to restore user data (that
is, outside the root file systems). Otherwise, first re-install the TSM client, then
restore the data.
More information on these topics can be obtained from the AIX system
documentation, from the redbook NIM: From A to Z in AIX 4.3, SG24-5524 and
from the SysBack Web site:
http://sysback.services.ibm.com/
/(root)
File System
System data makes up the operating system and its extensions. The system data
is always to be kept in the system file systems, namely / (root), /bin, /dev, /etc,
/lib, /usr, /tmp, and /var.
It is good practice to keep user (that is non-system) data out of the root volume
group file systems. User data might be files belonging to individual users, or it
might be application-specific data. User data tends to change much more often
than operating system data.
Hint: In AIX 5.1.1, if you modify the bos.inst.data file before creating the
mksysb output, using RECOVER_DEVICE=no, then device specific
customization will not be performed upon restoration. This option would be
useful in a coldsite disaster recovery scenario, where recovery equipment
configurations are often not identical to the primary site.s
As the mksysb tape is read during a system restore, the kernel and device drivers
are loaded, the system is prepared to receive the restored data, and the rootvg
data is restored.
The exclude files option can be chosen from the SMIT interface, which by default
refers to the /etc/exclude.rootvg file. From the command line, use the -e option for
the mksysb command.
The AIX savevg command finds and backs up all files belonging to a specified
volume group. It creates a volume group definition file by calling the mkvgdata
command. Savevg alone does not provide enough granularity to just save volume
group configuration data. The actual files in the volume group can be backed up
with TSM. In addition to this, the specific volume group configuration needs to be
saved so that it can be re-created at a later date to receive the restored data in
the event of a bare metal restore. We present here a script, vg_recovery, which
saves volume group configuration data in an easily restorable format.
The vg_recovery script saves the volume group configuration data, without
actually backing up any data within the volume group. Example 11-2 shows the
content of the vg_recovery script. Details of how to download a softcopy of this
script are given in Appendix D, “Additional material” on page 387.
if [ $i != "rootvg" ]
then
# Build filesystem exclude list
lsvg -l $i | egrep -v "LV|N\/A" | awk '{print $7}' > /etc/exclude.$i
if [[ $? != 0 ]]; then
exit 1
fi
done
Tip: If the vg_recovery script does not execute, run #chmod +x vg_recovery in
the directory where you have saved the script.
[Entry Fields]
* Restore DEVICE or FILE [/usr/local/vgdata/vg01]
+/
SHRINK the filesystems? no +
PHYSICAL VOLUME names [] +
(Leave blank to use the PHYSICAL VOLUMES listed
in the vgname.data file in the backup image)
Use existing MAP files? yes +
Physical partition SIZE in megabytes []
+#
(Leave blank to have the SIZE determined
based on disk size)
Number of BLOCKS to read in a single input []
#
(Leave blank to use a system default)
Select the device and appropriate options and run the command. Example 11-4
shows a sample execution.
[Entry Fields]
WARNING: Execution of the mksysb command will
result in the loss of all material
previously stored on the selected
output medium. This command backs
up only rootvg volume group.
The procedure for creating mksysb tapes is simple and can be performed using
locally attached tape devices. Writing mksysb images to tape has certain
The mkcd command allows users to write bootable mksysb images to writable CD
media. CD-R drives and media have become very inexpensive. Each CD-R disk
can contain up to 700 MB of data and DVD media holds up to 4.7 GB per
surface, which is usually sufficient for operating system data in a mksysb format.
The highest quality media is recommended.
Many CD-R devices are available on the market. Currently, four have been tested
with AIX for use in this capacity.
Yamaha CRW4416SX - CD-RW
RICOH MP 6201S- CD-R
Panasonic 7502-B - CD-R
Young Minds CD Studio - CD-R
A mksysb CD will only allow a new and complete overwrite install, because it
recreates the rootvg and restores the mksysb image. By default, the mkcd
command creates three file systems in the root volume group for the mksysb
image, the CD/DVD file system, and the CD images. This data is then written to
media using the mkcd command. Specific procedures for using the mkcd
command with DVD or CDR devices is provided in the redbook Managing AIX
Server Farms, SG24-6606 and also in the AIX man pages.
Note: The backup and restore method described below should not include
/etc/objrepos, /usr/lpp/ /unix, or /../core files.
As shown in Example 11-5, specific system files can be excluded in the client
options file through the use of exclude statements.
DOMAIN "/"
DOMAIN "/usr"
DOMAIN "/var"
DOMAIN "/home"
DOMAIN "/hacmp"
DOMAIN ALL-LOCAL
EXCLUDE /usr/lpp/.../*
EXCLUDE /.../objrepos/.../*
EXCLUDE /unix
EXCLUDE /.../core
The use of TSM backup archive client can compliment standard mksysb
procedures to lower the frequency of system backups needed for relatively static
operating system configurations. The list of files to be included should be
carefully compiled and tested. With mksysb and TSM, the system restore
procedure would be:
Boot and install system from latest mksysb media
Make necessary changes to device configurations (if needed)
Restore system data and user data with the TSM client (using -ifnewer option)
Tip: You can boot from the AIX installation CD if your mksysb media fails to
boot. The initial Welcome screen includes an option to enter a maintenance
mode in which you can continue the installation from your backup mksysb
media.
Once the mksysb install procedure is initiated, a few prompts for terminal settings
and language definitions appear. After these, the bare metal restore of AIX
continues without prompts until the system is restored.
The NIM application defines one NIM master and many NIM clients. The NIM
master is responsible for storing NIM client information (mksysb backups and
other resources), network information, and specific client machine definitions.
NIM allows one machine to act as the master for a given set of clients, but many
instances of the NIM master can coexist within a large AIX environment.
application
data backups
TSM Server
NIM data
NIM controls the backup and restore of AIX system data for each NIM client. A
TSM client is configured for the NIM master server to backup additional copies
and versions of the NIM client resource data.Specific elements within NIM are
also backed up to provide DR capabilities to the NIM server itself.
Figure 11-4 Bare metal restore of AIX clients using NIM and TSM
In a large DR scenario, NIM can be used to restore multiple AIX system images
at once. Network bandwidth determines the speed by which multiple machines
can be restored and must be accounted for in the Disaster Recovery Plan of
events. With proper systems design and planning, NIM bare metal restore can
outperform other system recovery methods.
TFTP, bootp, and NFS must be enabled on the NIM master server to run NIM.
For security and performance reasons, we recommend that you build a dedicated
server for NIM in a medium to large AIX environment.
Network hostname resolution is vital for the NIM master to be able to manage the
NIM client machines. If these machines are not on the local DNS server, an entry
for each NIM client hostname must be made in the /etc/hosts file. Name
resolution is also vital when mounting file systems from the resource servers.
Once the NIM master filesets are installed, configure the NIM master by using:
#smitty nim_config_env
The output from the basic NIM master setup screen is shown in Example 11-6.
This command will take some time to build the NIM master file systems and
resources. The LPPSOURCE and SPOT resources are built during the NIM
master creation. As long as the required space is available in the root volume
group, the installation will complete. Next, configure the NIM clients on each host
machine.
Note: The LPPSOURCE and SPOT resources are AIX version specific, so for
heterogeneous AIX environments, multiple LPPSOURCE and SPOT
resources must be configured to support system restores. When
implementing a NIM master to support a heterogeneous AIX environment,
choose clear names and locations for the additional NIM LPPSOURCE and
SPOT files.
[Entry Fields]
* Machine Name [sicily]
* Primary Network Install Interface [en0] +
* Host Name of Network Install Master [crete]
Be certain to select the appropriate NIM master machine name. By default NIM
master settings, the basic client definition data will be sent to the NIM master
server database automatically. NIM uses NFS and TFTP services to move NIM
and systems data between the server and client. Make certain NFS mounts are
allowed on the NIM client machine. On the NIM master, check to make sure the
new NIM client is defined as a standalone machine by running the lsnim
command shown inExample 11-8.
We suggest that NIM administrators use the following naming system to avoid
NIM database conflicts:
To conserve disk space on the NIM master server, we suggest older versions
of mksysb files be deleted. TSM can be used to backup and archive older
mksysb versions from the NIM master server as needed.
Select mksysb from the first list of options and then define the resource to the
NIM master as shown in Example 11-9.
[Entry Fields]
* Resource Name [Basic_Exclude ]
* Resource Type exclude_files
* Server of Resource [master]
+
* Location of Resource
[/export/exclude/Exclude_files] /
Comments [Standard Exclude List]
The SPOT resource is associated along with the mksysb and lppsource
resources of the NIM client. The SPOT resource will be installed from images in
the image source (AIX media, LPPSOURCE, etc.) and must match the operating
system version of the AIX client. The SPOT resource resides on the NIM Master
server and is stored in the /export/spot/spot1 directory. The SPOT is used to boot
the NIM client during network boot procedures.
When a SPOT is created, network boot images are constructed in the /tftpboot
directory of the NIM master server. During the network boot process for bare
metal restore, the NIM client uses TFTP to obtain the boot image from the server.
After the boot image is loaded into memory at the client, the SPOT is mounted in
the clients RAM file system to provide all additional software support required to
complete the installation operation. One SPOT resource can be used
redundantly for a client and does not need to be recreated with every mksysb
backup.
Force no +
This file references the client.info file, which specifies exactly which NIM
resources (SPOT, LPPSOURCE, MKSYSB) the NIM Master will use to restore
the NIM client. The client.info file listing for our NIM client is shown in
Example 11-13.
The client.info file, which is named after the client hostname, describes the NIM
resources to be mounted to the NIM client during a bare metal restore.
To make certain each NIM client is registered with a client.info and bootptab
listings on the NIM master, we follow this procedure for each NIM client or group
of NIM clients: smitty nim -> Perform Software Installation and Maintenance
Tasks -> Install and Update Software -> Install the Base Operating System
on Standalone Clients.
Select the target to be the standalone client you are configuring (sicily in our
case), choose mksysb for Installation TYPE, and choose the mksysb and spot
resources most recently created for the NIM client (under MKSYSB and LSPOT).
Also select the LPPSOURCE file which matches the NIM client operating system
version. Example 11-14 shows the general settings for performing this
procedure.
Example 11-14 Sample screen for installing the BOS for Standalone clients
* Installation Target sicily
* Installation TYPE mksysb
* SPOT sicily_spot
* LPP_SOURCE lpp_source1
MKSYSB sicily_mksysb2
installp Flags
COMMIT software updates? [no] +
SAVE replaced files? [no] +
AUTOMATICALLY install requisite software? [no] +
EXTEND filesystems if space needed? [yes] +
OVERWRITE same or newer versions? [no] +
VERIFY install and check file sizes? [no] +
ACCEPT new license agreements? [no] +
Preview new LICENSE agreements? [no] +
Since this procedure simply updates client settings on the NIM master server, it
is important to make certain the installp flags, initiate reboot, and set bootlist
settings are set exactly as they are shown above. This procedure ensures that
the NIM master will use the appropriate resources during a bare metal restore
client. Each time a new mksysb resource is created for a NIM client, this
procedure must be run on the NIM master.
This command provides a useful way to make sure all appropriate client
definitions are made and there are no errors in the NIM client setup. We see here
that the client is ready to install its base operating system from the mksysb image
called sicily_mksysb2.
All versions of the IPL-ROM can search local devices for an AIX boot image.
Only BOOTP-enabled IPL-ROM can use a network interface to search for a
remote boot image. Machines manufactured before 1993 do not have
BOOTP-enabled IPL-ROM. To determine if a NIM client machine requires
IPL-ROM emulation, run the command:
#bootinfo -q <network adapter name>
Insert a formatted diskette into the floppy drive on the NIM master and use
smitty IPL-ROM -> Create IPL-ROM emulation media. Select the target device
to create the IPL-ROM emulation media (diskette or tape), the pathname of the
emulation image, and the default value for the boot mode.
The hardware platform and kernel type client determine the procedure required
to boot the machine over the network. There may be rspc, rs6k, or chrp-based
hardware. To determine the platform of a machine, use the # bootinfo - p
command. To determine the kernel type of a running machine, use the bootinfo
-z command. The following section describes the boot procedures for an rspc
type machine. Other system type boot sequences are described in the redbook
NIM: From A to Z in AIX 4.3, SG24-5524.
Note: For ASCII terminals, press the F4 key as words representing the AIX
icons appear. The relevant function key will depend on the type and model of
rspc machine, so refer to your User Guide.
For later models of rspc, the functionality of the SMS diskette is incorporated
into the firmware, which is accessed by pressing the F1 key.
Note: If the recovery system relies on IPL-ROM support, select the floppy
drive (fd0) as the first device and the appropriate network adapter as the
second device.
11.Exit from SMS menu and commence NIM network boot and installation.
The NIM database should be regularly backed up using the smit nim_backup_db
fastpath. By default, the NIM database is backed up to
/etc/objrepos/nimdb.backup. In a recovery scenario, the NIM server is restored
using the smit nim_restore_db fastpath, assuming the nimdb.backup file is
already available in its default directory. A NIM database should only be restored
to the most recent version used in a NIM environment.
For anyone new to NIM, we highly recommend the following IBM Redbooks and
manuals for detailed information about NIM setup, administration,
troubleshooting, and customization:
NIM: From A to Z in AIX 4.3 , SG24-5524
AIX 5L Version 5.1 Network Installation Management Guide and Reference,
SC23-4385
AIX Version 4.3 Network Installation Management Guide and Reference,
SC23-4113
Install the TSM backup client onto the NIM master server and configure TSM to
provide regular backups of the following directories and files:
/export/
/tftpboot/
/etc/objrepos/nimdb.backup
/etc/objrepos/nim_attr
/etc/objrepos/nim_attr.vc
/etc/objrepos/nim_object
/etc/objrepos/nim_object.vc
/etc/niminfo
The NIM objects in the /etc/objrepos directory will be included in any case in the
mksysb backups of the NIM master server. Depending on the size of the NIM
environment, and frequency of NIM mksysb resource creations, the NIM master
server environment could become huge. TSM can help to limit the space
requirements on the NIM master and provide full recovery capabilities for the NIM
master environment.
Note: The system data for the NIM master server must be routinely backed up
to a bootable mksysb tape or CD/DVD for its own disaster recovery. We
suggest that users unmount or exclude the /export directory prior to creating
the NIM master mksysb if this is being backed up by TSM.
SysBack can be used to reinstall the system to its original device configuration,
including the volume group and logical volume placement on disk and attached
devices. SysBack can also be used to clone system images for restoration on
different systems in a disaster recovery scenario. SysBack also provides the
ability to automate versioning and incremental backups of system data, and
operations can be performed over non-local tape resources.
SysBack can boot through classic network boot or NIM resource boot
procedures. If SysBack is using classic boot procedure, the boot server AIX level
must match the clients being restored. If using the NIM resource boot procedure,
For more information about SysBack, please refer to the following Web site.
http://sysback.services.ibm.com/
However, to restore your system, you need to have previously gathered and
saved certain machine-specific characteristics, such as network and disk
partition information. Therefore, we discuss in detail methods for collecting this
information using operating system utilities and storing this information within
DRM. We then provide detailed instructions for recovery of Windows 2000 client,
step-by-step, in conjunction with TSM.
Scripts or batch files can be used to automate the collection of client information
for users not skilled in these kinds of system level commands. Client system
information should then be stored offsite for potential use during a Disaster
Recovery procedure. Client system information can be imported into DRM (via
scripts discussed in 12.1.3, “Storing system information for DRM access” on
page 271) or DRM administrators can be given access to system information
collected into a text file and backed up by the TSM backup-archive client.
msinfo32 has both a GUI and command line interface. This section will focus on
the command line interface since it can be scheduled for periodic execution using
mechanisms like the Tivoli Storage Manager Backup/ Archive Client scheduler,
scripts, or batch files. Generally msinfo32 will provide most of the required
information listed above and is installed by default with Windows 2000. It can be
run by entering this at the command-line:
C:\Program Files\Common Files\Microsoft Shared\MSInfo\msinfo32.exe
You can use msinfo32 to display configuration information interactively via the
GUI interface, or generate a text file report via GUI or command line batch
invocation. The text file is organized in categories and subcategories stanzas
which are delimited with [category] and [subcategory] headings. There are many
categories and subcategories of information in the report including Hardware
Resources, IRQs, Memory, Drivers, and Environment Variables.
A portion of the output for our BMR client machine follows in Example 12-2 —
notice the first [System Information] stanza.
[System Summary]
ItemValue
[Hardware Resources]
You probably should save the whole report, but if there are sections you are sure
would not be useful you may want to delete them. If you type msinfo32 /? you
can see various invocation options. If the /categories option did not seem to have
the granularity you desired, a script can be used to extract selected information.
A sample VBScript script called msinfoextract.vbs that can pull subcategories
out of the report is provided in “Reducing msinfo32 output using a VBScript” on
page 351.
Note: Note, running msinfo32 and generating a report may take some time. In
our case it took about a half minute to generate the report.
On our test client system, we created a batch file to automatically save system
information to a text file and them start the TSM backup/archive client. We
created an icon on our desktop with a link to this batch file which can be used as
a replacement for our TSM backup/archive launch icon. The sample batch file is
shown in Example 12-3.
Example 12-3 Batch file for saving machine information and starting TSM client
@echo off
echo.
echo SAVING MACHINE INFORMATION FOR DISASTER RECOVERY
An example of using diskmap and the output from our main system volume is
given in Example 12-4.
Signature = 0x5a3c8bb3
StartingOffset PartitionLength StartingSector PartitionNumber
* 32256 18186061824 63 1
MBR:
Starting Ending System Relative Total
Cylinder Head Sector Cylinder Head Sector ID Sector Sectors
* 0 1 1 1023 254 63 0x07 63 35519652
0 0 0 0 0 0 0x00 0 0
0 0 0 0 0 0 0x00 0 0
0 0 0 0 0 0 0x00 0 0
Once there is a backup copy of the msinfo32 report for this machine in the TSM
server you probably want to allow other users, such as the members of your
Disaster Recovery team, to access it. This assumes that they have TSM
Backup/Archive Client Node IDs registered for them. In Example 12-5, the TSM
Backup/Archive Client command line (the GUI can also be used) is used to
permit a TSM Client Node ID called DRTEAM to access the msinfo32.txt file
backed up from the directory c:\program files\common files\microsoft
shared\msinfo.
Assuming you have authorized it, a TSM Backup/Archive Client user on another
node could restore your msinfo32 report to a temporary directory on their
machine so that it can be referred to an alternative location while the destroyed
machine is rebuilt.
Alternatively, you can insert machine information using the command line as
shown in Example 12-6.
A script called machchar.vbs takes a text file and create a TSM macro file of
INSERT MACHINE commands. This macro can then be run by the TSM
administrator to load a DRM MACHINE table with the information. Example 12-7
uses the machchar.vbs script and the machine information report
(msinfo32bat.txt) to create a macro (msinfo32bat.mac) that inserts multiple lines
of client information automatically. A VBScript is run from the Windows command
line as shown.
Example
Example 12-8 Running the TSM macro to insert machine date Into DRM
macro “c:\Program Files\Common Files\Microsoft Shared\MSInfo\msinfo32bat.mac”
Figure 12-2 shows the machine client information now made available in DRM for
the machine called GALLIUM. We only show the machine summary stanza from
our msinfo32 output information. However, typically you would also include other
important stanzas from this output.
As discussed in “Break out the DR Plan” on page 208, the Recovery Plan File is
arranged in stanzas. After running the PREPARE commands, we would see the
following stanzas in the Recovery Plan File related to our system, GALLIUM as
shown in Example 12-11. Provided the plan file has been appropriately
protected, this information will be available after a disaster so that it can be used
to recover the client system.
begin MACHINE.CHARACTERISTICS
[System Summary]
Item Value
OS Name Microsoft Windows 2000 Server
Version 5.0.2195 Service Pack 2 Build 2195
OS Manufacturer Microsoft Corporation
System Name GALLIUM
System Manufacturer IBM
System Model eserver xSeries 330 -[867411X]-
System Type X86 - based PC
Processor x86 Family 6 Model 11 Stepping 1 Genuine Intel ~1128 Mhz
Processor x86 Family 6 Model 11 Stepping 1 Genuine Intel ~1128 Mhz
BIOS Version IBM BIOS Ver 0.0
Windows Directory C:\WINNT
System Directory C:\WINNT\System32
Boot Device \Device\Harddisk0\Partition 1
Locale United States
User Name GALLIUM\Administrator
[Hardware Resources]
end MACHINE.CHARACTERISTICS
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
Rather than using Microsoft’s logical place holder (the System State), the
TSMclient places individual components (such as Active Directory and the
registry) in the TSM System Object. Other Windows 2000 features that are not
part of the Microsoft System State, such as the Removable Storage
Management database, are also included as components of the TSM System
Object. TSM uses documented Microsoft Application Programming Interfaces
(APIs) to backup system objects. Other objects such as the registry do not have
interfaces that TSM can directly access. In these instances, TSM internally uses
Microsoft utilities to copy the system objects to a staging directory and then the
objects are backed up from that directory. A restore is performed in the reverse
order. System objects that are backed up and restored together include the
following.
Active directory (domain controller only)
Certificate server database
To back up all partitions, including the system partition, TSM requires access to
regular files, the directory or the local drive in order to back them up. System
objects for Windows 2000 show up in the backup and restore screens of the TSM
GUI according to what services are available on the system that TSM is
attempting to back up.
Note: From the command line interface for the Backup/Archive client, the
BACKUP SYSTEMOBJECT command can be used to back up all valid system
objects on a Windows 2000 system.
Important: The TSM backup/archive client does not allow an image backup of
system objects or an image restore of a boot / system partition (C:) to its
original location. In order to perform BMR with image backups, it is necessary
to restore the image from another partition. After the image is recovered, the
registry would have to be manually copied from SystemDrive\ADSM.SYS to
SystemRoot\System32\Config and the restored partition booted. Finally, any
other post-BMR steps, such as restoring and activating the Active Directory,
would have to be performed.
During restore processing of the Registry, a copy of the current Registry is first
placed into the directory SystemDrive\ADSM.SYS\w2kreg.sav. This directory
may be able to be used to restore the Registry in the event of a restore failure.
Note: You need to be logged in as a user with Backup Data, Restore Data,
and Audit Security to backup any local system data (such as registry and
system files). For any network-centric system objects, such as the Active
Directory, Domain Administrator rights are required.
The method for backing up our system object files, boot / system partition (C:),
and data partitions via the TSM Backup/Archive GUI is shown in Figure 12-3.
Note, system objects are automatically included in the ALL-LOCAL domain for
backup.
To confirm that all of the System Object components were backed up as a single
entity, run the command QUERY SYSTEMOBJECT. This will display the backup
timestamps for each component and it should be obvious if a rogue backup of an
isolated System Object component exists. Although this is an unlikely event, it is
still worth taking the time to check, given the problems that it could cause.
Example 12-13 shows the results of the QUERY SYSTEMOBJECT command.
tsm>
Our restore procedure assumes that a TSM server is installed with appropriate
code levels and patches, that the client node is registered with the server, and
that backups of all partitions and system objects have been taken. For the
restore, access to a Windows 2000 Installation CD and TSM Backup/Archive
Clients Installation CD is also required. The new hardware should be identical to
the original hardware. In some cases, it may be possible to restore to hardware
which is slightly different, for example, different disk capacities. We recommend
thoroughly testing the restore process if different hardware is to be used. The
system’s hardware components should already be correctly installed and
configured. This includes, but is not limited to:
System has power to all components
Keyboard, mouse and monitor are connected
Network controllers are installed and connected to the network
Cabling of disk controllers and array controllers is complete
Check that the hardware firmware is at the correct level (ideally this firmware
should be at the same level as when the backup was taken)
Restore procedure
1. Boot from Windows 2000 Installation CD.
– Windows 2000 is supplied on CD-ROM. The CD is also a bootable entity.
Ensure the boot order in your system BIOS are set to CD-ROM first.
– Windows 2000 will automatically boot if your hard drive is blank, otherwise
you will get a prompt on screen for a few seconds saying Press any key
to boot from CD. At that point hit a key CD.
2. Create a system partition (C:).
– Make the Windows 2000 boot partition the same size as the system being
recovered. Remember this information should have already been collected
using msinfo32, found within the [Storage] [Drives] stanzas. The partition
should also be formatted using the same file system, either FAT32 or
NTFS.
– The partition should have the same drive letter. The Windows 2000
operating system folder must be named the same as the system being
recovered (this will usually be WINNT).
3. Set computer name and complete a basic Windows 2000 install.
– Configure the server with the same computer name as the system being
recovered — GALLIUM in our case.
– Place the server into a temporary workgroup (use a workgroup name that
does not exist). Do not make the server a domain member.
– Set the time and date correctly.
– There is no need to install additional services or applications that will not
be used by the recovery process. For example, Terminal Server or
Macintosh services. Installing such items will only increase the amount of
time required to complete the operating system installation and may in fact
add unnecessary complications to the restore process. To speed up the
install time, you can deselect components which are installed by default,
such as Accessories, Internet Information Server, Indexing Service and
Script Debugger.
4. Set TCP/IP address information to the original system settings.
Note: Although, it is desirable to get the partition sizes to match the original
system, this is not absolutely crucial. As long as there is sufficient space to
restore all the data, it should not affect the success of the recovery.
Tip: Before starting the restore, confirm the consistency of the System Object
backup by running the command QUERY SYSTEMOBJECT from the Tivoli Storage
Manager client command line.
11.If the system is not a Domain Controller, perform a TSM restore of the
Windows System Objects.
– Select the System Object for restore. Do not select individual objects for
restore.
– Continue the restore process.
– At the end of the restore, select to reboot.
12.If the system is a Domain Controller:
– Restart the machine in Directory Recovery mode.
– Restore the Active Directory. (If you wish to do an authoritative restore,
use the NTDSUTIL utility to accomplish this. This is not usually desired in
this scenario).
– Restore any other appropriate system objects, such as the Certificate DB,
SYSVOL, and so on. Which of these you recovery will depend on what
services you were running on the client prior to its failure.
Note: The event logs are not restored back into the Operating System (that is,
they do not become active). They are restored into the folder
\adsm.sys\eventlog.
To view the logs you should point the event log viewer to the appropriate log
file in this folder.
An overview of applications from Cristie, Ultrabac, and VERITAS for bare metal
recovery is given below. In each case we discuss how TSM can be integrated
with these solutions.
12.2.1 Cristie
Cristie is a provider of data storage and backup solutions, integrated with IBM
Tivoli Storage Manager to provide a Bare Metal Restore (BMR) solution for
Window users with Solaris and Linux versions available shortly. The combined
12.2.2 Ultrabac
UltraBac Software integrate its backup and disaster recovery software for
Windows based servers and workstations with IBM Tivoli Storage Manager using
the TSM API. The combination of these products allows customers to take
advantage of UltraBac's client level backup features while maintaining the
enterprise level backup functionality already provided by IBM Tivoli Storage
Manager. Customers will also be able to exploit backup and recovery features
found in UltraBac, such as Image Disaster Recovery.
UltraBac manages the administration of the backup and restore processes, while
Tivoli manages the media that these backups are written to. This allows users
with existing Tivoli Storage Manager installations to utilize these processes for
their backups, as well as providing an avenue for UltraBac users desiring
enterprise-level media management facilities to combine the strengths of
UltraBac and Tivoli offerings. The device is built on top of the Tivoli Storage
Manager API for a robust interface with the Tivoli Storage Manager architecture,
and has been qualified for TSM API levels 4.2 and above.
Further information about Ultrabac and trial downloads can be found at:
http://www.ultrabac.com/
BMR uses the existing TSM server and adds BMR server, boot server and file
server components. The BMR server manages the process to restore a client. It
makes the appropriate boot image and file systems available to the client,
ensures that the boot server and file server are properly configured and
generates a customized client boot script. At restore time, the file server makes
the necessary file systems available to the client via NFS. This includes the file
system that contains the necessary operating system commands and libraries,
the BMR client code and the TSM client code.
The TSM server provides the backup files required to restore the system. In
order to restore the system from TSM, all client files must be backed up to the
TSM server. The standard TSM client is used to restore all files from the TSM
server, including the operating system, applications, configuration files and users
files.
Collecting and recording information about your client systems will greatly help
your ability to restore your Linux machine after a disaster to a pre-disaster state.
There are several commands, utilities and features built into Linux that can assist
you with information collection, for example, fdisk, the /proc directory, df,
ifconfig, and the /etc/sysconfig directory. The information that should be
collected for the client system should include:
Hard drive partition information, for example, number and type of partitions,
disk size, mount points, amount of data used per volume, boot partition, and
root directory
System hostname
TCP/IP networking information including IP address, subnet mask, default
gateway, DNS information
Operating System levels
Scripts or batch files can be used to automate the collection of client information
for users not skilled to run these kinds of system level commands. Client system
information should then be stored offsite for potential use during a disaster
recovery. Client system information can then be imported into DRM (via scripts
discussed in Chapter 12, “Windows 2000 client bare metal recovery” on
page 267) or DRM administrators can be given access to system information
collected into a text file and backed up by the TSM Backup/Archive client.
The /proc/pci virtual file (shown in Example 13-2) lists the PCI devices installed
on the system. The information provided by this file may help resolve driver and
support issues.
Example 13-5 Using df to output filesystem size, percent used, and mount points
#df
NETWORKING=yes
HOSTNAME=tonga
GATEWAY=9.1.38.1
Specific instruction for inserting machine information into DRM as well as a more
detailed discussion of the machchar.vbs script are given in Chapter 12,
“Windows 2000 client bare metal recovery” on page 267.
To backup our system files we used an Iomega 250 USB ZIP Drive, which was
automatically recognized by the base Red Hat Linux 7.1 install. We used tar to
archive our boot and system files. The system files we backed up are listed in
Table 13-1.
Generally a good rule of thumb is to backup any directories required at boot time
(a quick scan of the boot initialization file /etc/rc.sysinit will reveal them). We
used tar to package these directories as shown in Example 13-8. Once these
files were generated we copied them to our mounted zip drive (/mnt/zip250.0/).
Full instructions for downloading and creating the tomsrtbt rescue disk can be
found at that URL. Note, there is also a bootable rescue CD utility found at this
URL. We downloaded tomsrtbt-2.0.103.tar.gz for a Linux/GNU installation and
Table 13-2 Basic Linux Red Hat BMR Procedure with TSM
Linux Redhat BMR Preparation and Restore Procedure with TSM
2 Install any removable media drivers required and mount media device.
3 Use fdisk in bootable “mini-root” partition to rebuild the partition table from
pre-disaster information.
6 Replace ‘/’ (root) directory with temporary /root directory using chroot
command.
Our environment consists of Linux Red Hat 7.1 Server installed on an Intel PC.
Our TSM server has TSM version 5.1.1 installed and our TSM Backup/Archive
client software was at version 5.1.0. We have a boot/system partition on hd1 and
6 other Linux partitions configured. To simulate a disaster in our environment, we
deleted the partition table by running the dd command shown in Example 13-10.
Subsequently, we were not able to boot our system.
Restore procedure
1. Boot from rescue media or bootable “mini-root” diskette/CD, for example,
tomsrtbt.
With the tomsrtbt rescue diskette in the floppy drive during startup, the system
will boot directly to a root shell with several utilities available.
When the system begins initializing several prompts appear to specify display
mode, keyboard type, and others. In our case all default settings were
suitable. Once this initialization is complete the following prompts, shown in
Example 13-11 will appear. You are now in the root directory of the “mini-root
shell.
Password:
2. Install any Removable media device drivers required and mount the media
device.
Since our ZIP drive has a USB host attachment, we required USB add-ons to
be installed in our root shell (similarly this would be an appropriate time to
install drivers or add-ons for any other device that you plan to extract your
system files from). The appropriate add-ons (usbcore.o, usb-uhci.o,
Example 13-12 Floppy drive mount, driver install, zip drive mount commands
#mount /dev/fd0H1440 /mnt
#insmod /mnt/usbcore.o
#insmod /mnt/usb-uhci.o
mkdir /mnt/zip
3. Use the fdisk command in the bootable mini-root to rebuild the partition
table (based on fdisk -l partition information gathered before the disaster).
We need to repair (or rebuild) our root disk. The fdisk utility can now be used
to repartition the root disk to its exact format before the disaster. In order for
us to do this we must have the output from the fdisk -l command taken
before the disaster. Luckily we collected this information in Example 13-3.
While fdisk may seem cryptic at first, it is actually quite simple. Essentially
fdisk is used to rebuild the partition table, character by character. You just
need to be careful when following each instruction below.
Example 13-13 Using fdisk to repartition the root disk to pre-disaster state
Original fdisk -l output is given here again for reference.
# fdisk /dev/hda
We must also change the partition type of /dev/hda5 to "Linux Swap" using the
‘t’ option and set the bootable flag on /dev/hda1 with the ‘a’ option.
#mkdir /root
6. Replace ‘/’ (root) directory with temporary /root directory using chroot
command.
We now need to restore the lilo boot block. LILO (LInux LOader), places itself
in a boot sector of your hard disk. LILO installs a boot loader that will be
activated the next time you boot. The chroot command here takes two
arguments: the first is a directory name, which will cause that directory to
become the new root directory, that is, the starting point for path searches of
pathnames beginning with ‘/’. The second argument is a command which will
then be run. To run the lilo command without causing errors, we use chroot
followed by the lilo command. This is shown in Example 13-16.
Example 13-16 Replacing the ‘/’ directory and restoring the lilo boot block
# chroot /root /sbin/lilo
#rpm -i TIVsm-API.i386.rpm
#rpm -i TIVsm-BA.i386.rpm
Note, at this point, configuration of your dsm.opt file and dsm.sys file should
be configured to point to the TSM Server your client is registered with.
9. Perform a TSM restore of all other file systems and data.
You can proceed with restoring the remainder of your system from the TSM
Backup/Archive client. What you restore depends on what applications or
Example 13-18 Performing a command line restore of files systems and data
#cd /opt/tivoli/tsm/client/ba/bin
Note, if you want to restore files from your /opt directory be sure to exclude
the /opt/tivoli directory to avoid termination of your TSM client session.
10.Verify that the restore is valid.
– Check event/error logs for issues. In particular check for process and
device driver failure.
– Check that locally defined user and group accounts are present.
– Check that print queues are present and functioning.
– Check security and permissions on files and directories.
– Check that the time zone and system time is correct.
– Ask all users who use the system to check that their profiles have been
restored.
– Test applications for operation.
Typically convert SAN traffic to IP or ATM protocol for site to site data transfer
Maximize IP storage traffic from site to site
Requires dedicated and redundant network connections
Long distance capabilities for storage network traffic
Performance dependent on connection conditions and distance
IP and ATM wide area networks have been in use for a long time, but an
emerging trend is the use of channel extenders to transfer storage traffic to these
WAN environments. Companies such as CNT offer the ability to translate FCP
and IP storage network traffic across long distances using IP or ATM protocols.
The protocol conversion is transparent to a SAN network fabric. For very long site
to site distance requirements, this technology clearly provides a unique
capability. Telecommunications network connections ranging from T-1 to OC-48
can be used with channel extenders. While latency over long distances can be
an issue, availability of network connections is usually not.
For campus environments, native SAN technology provides a robust and easily
managed solution for site to site data transfer. Using long wave GBICs and 9
micron single mode fibre cable, FCP traffic can be routed up to 10 km across
optical fibre SAN connected devices. By using repeaters (switches with long
wave ports) and dedicated fibres, we can achieve distances of about 40 km,
however access points to the fibre must be available approximately every 10 km,
which is a limiting factor in many cases. Nonetheless, the native SAN solution
provides a scalable and cost effective solution, where cable can be physically run
site to site without easement or access issues.
TSM Clients
Syncronous Write Environment
TSM Clients
D D
TSM Server R TSM Server R
M M
A mirrored site environment typically staffs both production sites 24x7. The TSM
environment can be automated to a high degree, however storage administration
staff must be available for both sites at any given time to monitor operations,
troubleshoot problems, and restore the lost TSM environment in the case of a
single site disaster.
Recovery Time
Complexity
Cost
Potential Data Loss
Scheduled
Restores
D Standby D
TSM Server R TSM Server R
M M
In the event of a production site disaster, TSM can be quickly restored using
DRM and the mirrored copy of the TSM database. Primary disk pool volumes
and copy storage pool volumes can then be accessed at the recovery site and
hotsite systems can be restored according to the Disaster Recovery Plan.
Recovery Time
Complexity
Cost
Potential Data Loss
Scheduled
Restores
TSM Server Clustering
D Standby D
TSM Server R TSM Server R
M M
Figure 14-4 Clustered TSM Servers, with site to site electronic vaulting
Option A:
Asyncronous copy pool writes using
D Target
Standby
TSM D
TSM Server R virtual volumes over IP
TSM Server R
Server
M M
Option B:
Syncronous Copy pool writes to
SAN attached Tape Library
Figure 14-5 Asynchronous TSM DB mirroring and copy storage pool mirroring
The second option is to write copy storage pool data synchronously to the offsite
tape library using a SAN infrastructure. In the event of a disaster, the TSM DRM
Recovery Time
Complexity
Cost
Potential Data Loss
D Standby D
TSM Server R TSM Server R
M FTP TSM recovery logs and DRM to M
remote site
DB DB
Figure 14-6 Electronic vaulting of TSM DRM, manual DB, and copy pool vaulting
Using a secure WAN network, the DRM output can be sent to the remote site
using FTP services. The TSM database backups can be manually vaulted along
with the copy storage pool data to the warmsite environment for disaster
Recovery Tim e
Complexity
Cost
P otential Data Loss
TSM Clients
TSM Clients
D D
TSM Server R TSM Server R
M M
Figure 14-7 Dual production sites, electronic TSM DB and copy pool vaulting
Recovery Time
Complexity
Cost
Potential Data Loss
Critical Hardware
WAN / IP
SAN / FCP
TSM Clients DWDM + Dark Fibre
TSM D
Server R
M
DB
TSM Database
Offsite Vaulting
Manual Vaulting
Figure 14-8 Remote TSM Server at warm site, plus manual offsite vaulting
In this set up the Bank can restore any application server in the event of some
local disaster like a hard drive crash, or individual server loss. In the event of loss
of the TSM server, backup of the database and configuration files has been
prepared for Disaster Recovery, but the estimated recovery time for the TSM
server was too long.
Problem
The Bank asked IBM to help develop a Disaster Recovery solution for the
decentralized servers. An RTO of 4 hours is established for the most important
application servers. The Bank also decided to start develop a new branch office
solution based on a Linux platform, using DB2 database and WebSphere. The
backup, archive, and DR solution to has to cover all new servers and
applications.
Solution
IBM analyzed the current situation in the Bank and suggested implementing a
solution with the following steps:
Equip the IBM TotalStorage Enterprise Tape Library in the backup location
with additional tape drives for the TSM server and connect the TSM server
and the drives using existing SAN connection between the two locations.
Define this second library on the TSM server and create copy storage pools
on the second library on remote location.
Backup the primary storage pools from the tape library in the primary location
to copy storage pools on the tape library in the backup location.
These steps will increase the availability of the TSM server and data, and
implement prerequisites for the Disaster Recovery solution for existing servers.
See Figure 14-10 for an illustration of how protection for the TSM server has
been improved.
In this situation, the Bank can now perform a Disaster Recovery for the TSM
server, providing another server is available to replace it. The next series of
recommended steps should solve all the remaining requirements.
Buy a second TSM server for the backup location, and add new tape drives
for the second TSM server to both libraries.
Define disk space on the ESS for second TSM server, and mirror this disk
space using PPRC to the other (primary) location.
Geographically disperse clustered servers.
The Bank’s main IT operations center includes more than 150 open system
servers, based on Windows, Solaris, AIX, and HP platforms. A team of six
administrators was using a variety of local tape drives and media for data and
operating system backups. Network based backups were done using three
different storage management applications, which tended to consume large
amounts of tape and network resources. These backup products also
experienced problems with restoring data in a reasonable time scale. Nobody
was able to guarantee an RTO in the case of a major disaster caused by volcanic
eruption or a tsunami. The administrator’s skills were sufficient to perform a bare
metal restore in the case of a single server crash, but very little enterprise policy
existed for disaster recovery. The data backup policy was decentralized and no
enterprise standards for backup, retention, or archive existed.
Problem
In this situation the Bank asked an IBM Business Partner to develop a solution to
provide data consolidation, centralized backup, and increased disaster recover
ability. Disaster recovery RTO requirements for critical servers was a 2 hour time
frame, but the existing infrastructure barely provided restore capabilities within a
12 hour window. Furthermore, enterprise storage growth estimates were 30%
per year, and no current strategy or product provided a manageable solution.
Current software license agreements for backup software were soon to expire,
and a critical decision for enterprise storage management strategy had to be
made.
On a business level, the Bank required the development of a hot site data center
within 10 KM of the main production facility to provide DR capabilities for the
enterprise. Rigorous regulatory requirements mandated that critical data must be
continuously available and recoverable.
Solution
In the first phase, the IBM Business Partner analyzed the banking IT environment
and daily processing to estimate the total amount of data stored, daily change
rate of data, and the overall volume of data storage projected for a 3 year time
frame. The bank had approximately 150 TB of data, of which around 5 TB
changed daily change. Approximately 50% of the data resided in databases.
Since the majority of the Bank’s clients were located in the Pacific rim, the Bank
The IBM Business Partner designed a solution where data was consolidated on
IBM TotalStorage ESS disk arrays, with a backup and disaster recovery solution
based on TSM servers and two IBM TotalStorage Enterprise Tape Libraries in
each of the primary and backup locations.The IBM Business Partner also
recommended to the Bank to move some critical servers to the backup location,
and by this action, decrease the impact and required recoveries in the case of
local disaster.
We understand that we have packed many DR and TSM concepts into this guide.
As such we do not want to reiterate all those details here. We refer you to the rest
of this redbook for that. Instead we provide you with a description that lists the
components that should be considered for DR and TSM. Figure 14-12 presents
this summary.
DR Testing
and Maintenance
1. Develop a test schedule with pre-defined test scenarios.
2. Fully document each activity during the recovery test.
3. Review the results of each test initiate necessary corrections
Detailed discussion of these products is beyond the scope of this book, however
we recommend the following Redbooks for more details on TSM’s capabilities for
application and database backup.
Backing Up Oracle Using Tivoli Storage Management, SG24-6249
Backing Up DB2 Using Tivoli Storage Manager, SG24-6247
Backing Up Lotus Domino R5 Using Tivoli Storage Management, SG24-5247
Using Tivoli Data Protection for Microsoft Exchange Server, SG24-6147
R/3 Data Management Techniques Using Tivoli Storage Manager,
SG24-5743
Using Tivoli Data Protection for Microsoft SQL Server, SG24-6148
Using Tivoli Storage Manager to Back Up Lotus Notes, SG24-4534
Backing up WebSphere Application Server Using Tivoli Storage
Management, REDP0149
Part 3 Appendixes
A.1.1 Purpose
This {system name} Disaster Recovery Plan establishes procedures to recover
the {system name} following a disruption. The following objectives have been
established for this plan:
Maximize the effectiveness of contingency operations through an established
plan that consists of the following phases:
– Notification/Activation phase to detect and assess damage and to activate
the plan
– Recovery phase to restore temporary IT operations and recover damage
done to the original system
– Reconstitution phase to restore IT system processing capabilities to
normal operations.
Identify the activities, resources, and procedures needed to carry out {system
name} processing requirements during prolonged interruptions to normal
operations.
Assign responsibilities to designated {Organization name} personnel and
provide guidance for recovering {system name} during prolonged periods of
interruption to normal operations.
Ensure coordination with other {Organization name} staff who will participate
in the Disaster Recovery Planning strategies. Ensure coordination with
external points of contact and vendors who will participate in the Disaster
Recovery Planning strategies.
A.1.2 Applicability
The {system name} Disaster Recovery Plan applies to the functions, operations,
and resources necessary to restore and resume {Organization name}’s {system
name} operations as it is installed at {primary location name, City, State}.
The {system name} Disaster Recovery Plan applies to {Organization name} and
all other persons associated with {system name} as identified under A.2.3,
“Responsibilities” on page 337.
A.1.3 Scope
The plan scope outlines the planning principles, assumptions, policy references,
and a record of changes.
A.1.3.2 Assumptions
Based on these principles, the following assumptions were used when
developing the IT Disaster Recovery Plan:
The {system name} is inoperable at the {Organization name} computer center
and cannot be recovered within 48 hours.
Key {system name} personnel have been identified and trained in their
emergency response and recovery roles; they are available to activate the
{system name} Disaster Recovery Plan.
Preventive controls (for example, generators, environmental controls,
waterproof tarps, sprinkler systems, fire extinguishers, and fire department
assistance) are fully operational at the time of the disaster.
The {system name} Disaster Recovery Plan does not apply to the following
situations:
Overall recovery and continuity of business operations. The Business
Resumption Plan (BRP) and Continuity of Operations Plan (COOP) are
appended to the plan.
Emergency evacuation of personnel. The Occupant Evacuation Plan (OEP) is
appended to the plan.
Any additional constraints should be added to this list.
A.1.4 References/requirements
This {system name} Disaster Recovery Plan complies with the {Organization
name}’s IT DR Planning policy as follows:
A.2.3 Responsibilities
The following teams have been developed and trained to respond to a
contingency event affecting the IT system.
The relationships of the team leaders involved in system recovery and their
member teams are illustrated in the figure.
Insert hierarchical diagram of recovery teams. Show team names and leaders;
do not include actual names of personnel.
Describe each team separately, highlighting overall recovery goals and specific
responsibilities. Do not detail the procedures that will be used to execute these
responsibilities. These procedures will be itemized in the appropriate phase
sections.
Upon notification from the DR Planning coordinator, Team Leaders are to notify
their respective teams. Team members are to be informed of all applicable
information and prepared to respond and relocate if necessary.
The following procedures are for recovering the {system name} at the alternate
site. Procedures are outlined per team required. Each procedure should be
executed in the sequence it is presented to maintain efficient operations.
External System Contacts {Identify the Responsibilities
individuals, positions, or offices outside
your organization that depend on or
support the system; also specify their
relationship to the system}
Hardware Resources
Software Resources
Other Resources
Using this script, you could build your own report of just the subcategories you
want, for example:
Cscript msinfoextract.vbs msinfo32.rpt system summary > msinfo32x.rpt
Cscript msinfoextract.vbs msinfo32.rpt drives >> msinfo32x.rpt
Cscript msinfoextract.vbs msinfo32.rpt adapter >> msinfo32x.rpt
An alternative approach might be to write a script that eliminates the stanzas you
specify.
end PLANFILE.DESCRIPTION
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin PLANFILE.TABLE.OF.CONTENTS
PLANFILE.DESCRIPTION
PLANFILE.TABLE.OF.CONTENTS
end PLANFILE.TABLE.OF.CONTENTS
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin SERVER.REQUIREMENTS
end SERVER.REQUIREMENTS
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin RECOVERY.INSTRUCTIONS.GENERAL
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin RECOVERY.INSTRUCTIONS.OFFSITE
The offsite vault IronVault, Fort Knox, Kentucky. Ph 800 499 3999 - 24-hour guaranteed priority
response line.
The courier company is Fast Leg Courier Service, Ph 877 838 4500.
Make sure the list of volumes required for recovery is ready for faxing (800 499 9333) or
e-mail (emergency@ironvault.com) to the vaulting service
end RECOVERY.INSTRUCTIONS.OFFSITE
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin RECOVERY.INSTRUCTIONS.INSTALL
TSM server requires Intel server machine from PC support group. Minimum 512MB RAM, 12 GB of
disk, CD-ROM drive and Ethernet card
Install Windows 2000 and Service Pack 2. TCP/IP address is radon.ourcompany.com, 192.1.5.1,
subnet mask 255.255.255.0, router, 192.1.5.254.
Install LTO drivers from http://index.storsys.ibm.com
Install TSM server v 5.1 from install CD
Install TSM server update v 5.1.1.0
end RECOVERY.INSTRUCTIONS.INSTALL
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin RECOVERY.INSTRUCTIONS.DATABASE
end RECOVERY.INSTRUCTIONS.DATABASE
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin RECOVERY.INSTRUCTIONS.STGPOOL
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin RECOVERY.VOLUMES.REQUIRED
end RECOVERY.VOLUMES.REQUIRED
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin RECOVERY.DEVICES.REQUIRED
end RECOVERY.DEVICES.REQUIRED
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
@echo off
rem Purpose: This script contains the steps required to recover the server
rem to the point where client restore requests can be satisfied
rem directly from available copy storage pool volumes.
rem Note: This script assumes that all volumes necessary for the restore have
rem been retrieved from the vault and are available. This script assumes
rem the recovery environment is compatible (essentially the same) as the
rem original. Any deviations require modification to this script and the
rem macros and scripts it runs. Alternatively, you can use this script
rem as a guide, and manually execute each step.
rem Restore the server database to latest version backed up per the
rem volume history file.
"C:\PROGRAM FILES\TIVOLI\TSM\SERVER\DSMSERV" -k "Server1" restore db todate=07/26/2002
totime=17:23:38 source=dbb
rem Tell the Server these copy storage pool volumes are available for use.
rem Recovery Administrator: Remove from macro any volumes not obtained from vault.
dsmadmc -id=%1 -pass=%2 -ITEMCOMMIT
-OUTFILE="C:\DRM\PLAN\RADON.COPYSTGPOOL.VOLUMES.AVAILABLE.LOG" macro
"C:\DRM\PLAN\RADON.COPYSTGPOOL.VOLUMES.AVAILABLE.MAC"
rem Volumes in this macro were not marked as 'offsite' at the time
rem PREPARE ran. They were likely destroyed in the disaster.
rem Recovery Administrator: Remove from macro any volumes not destroyed.
dsmadmc -id=%1 -pass=%2 -ITEMCOMMIT
-OUTFILE="C:\DRM\PLAN\RADON.COPYSTGPOOL.VOLUMES.DESTROYED.LOG" macro
"C:\DRM\PLAN\RADON.COPYSTGPOOL.VOLUMES.DESTROYED.MAC"
:end
end RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE script
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
@echo off
rem Purpose: This script contains the steps required to recover the server
rem primary storage pools. This mode allows you to return the
rem copy storage pool volumes to the vault and to run the
rem server as normal.
rem Note: This script assumes that all volumes necessary for the restore
rem Create replacement volumes for primary storage pools that use device
rem class DISK.
rem Recovery administrator: Edit script for your replacement volumes.
call "C:\DRM\PLAN\RADON.PRIMARY.VOLUMES.REPLACEMENT.CREATE.CMD" 1>
"C:\DRM\PLAN\RADON.PRIMARY.VOLUMES.REPLACEMENT.CREATE.LOG" 2>&1
type "C:\DRM\PLAN\RADON.PRIMARY.VOLUMES.REPLACEMENT.CREATE.LOG"
rem Restore the primary storage pools from the copy storage pools.
dsmadmc -id=%1 -pass=%2 -ITEMCOMMIT -OUTFILE="C:\DRM\PLAN\RADON.STGPOOLS.RESTORE.LOG" macro
"C:\DRM\PLAN\RADON.STGPOOLS.RESTORE.MAC"
:end
end RECOVERY.SCRIPT.NORMAL.MODE script
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin LOG.VOLUMES
"C:\TSMDB\TSMLOG01.DB" 128
end LOG.VOLUMES
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin DB.VOLUMES
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
@echo off
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
/* Purpose: Mark copy storage pool volumes as available for use in recovery. */
/* Recovery Administrator: Remove any volumes that have not been obtained */
/* from the vault or are not available for any reason. */
/* Note: It is possible to use the mass update capability of the server */
/* UPDATE command instead of issuing an update for each volume. However, */
/* the 'update by volume' technique used here allows you to select */
/* a subset of volumes to be processed. */
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
@echo off
rem Purpose: Create replacement volumes for primary storage pools that
rem use device class DISK.
rem Recovery administrator: Edit this section for your replacement
rem volume names. New name must be unique, i.e. different from any
rem original or other new name.
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
/* Purpose: Restore the primary storage pools from copy storage pool(s). */
/* Recovery Administrator: Delete entries for any primary storage pools */
/* that you do not want to restore. */
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin VOLUME.HISTORY.FILE
***********************************************************************************************
**********
*
* Sequential Volume Usage History
* Updated 07/26/2002 18:30:58
*
* Operation Volume Backup Backup Volume Device Volume
* Date/Time Type Series Oper. Seq Class Name Name
***********************************************************************************************
**********
2002/07/08 18:01:49 STGNEW 0 0 0 CLASS1 IBM001
2002/07/09 17:42:32 STGNEW 0 0 0 CLASS1 IBM002
2002/07/09 17:43:55 STGNEW 0 0 0 CLASS1 IBM003
2002/07/10 10:33:36 STGNEW 0 0 0 CLASS1 IBM004
2002/07/24 17:06:17 STGDELETE 0 0 0 CLASS1 IBM001
2002/07/24 17:06:44 STGDELETE 0 0 0 CLASS1 IBM003
2002/07/24 17:28:41 STGNEW 0 0 0 CLASS1
ABA920L1
2002/07/24 18:04:52 STGDELETE 0 0 0 CLASS1 IBM004
2002/07/24 18:17:53 STGNEW 0 0 0 CLASS1
ABA922L1
2002/07/24 18:20:02 STGDELETE 0 0 0 CLASS1 IBM002
2002/07/24 18:32:39 STGNEW 0 0 0 CLASS1
ABA924L1
* Location for volume ABA925L1 is: 'Ironvault, Fort Knox, Kentucky'
2002/07/24 19:20:33 BACKUPFULL 6 0 1 CLASS1
"ABA925L1"
2002/07/26 15:49:42 STGNEW 0 0 0 CLASS1
"ABA926L1"
* Location for volume ABA927L1 is: 'IronVault, Fort Knox, Kentucky'
2002/07/26 17:23:38 BACKUPFULL 7 0 1 CLASS1
"ABA927L1"
end VOLUME.HISTORY.FILE
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin DEVICE.CONFIGURATION.FILE
end DEVICE.CONFIGURATION.FILE
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin DSMSERV.OPT.FILE
* ====================================================================
* Tivoli Storage Manager
* Server Options File - Version 4, Release 2, Level 0
* 5639-A09 (C) Copyright IBM Corporation, 1990, 2001,
* All Rights Reserved.
* ====================================================================
*
* Tivoli Storage Manager (TSM):
* Server Options File (dsmserv.opt)
* Platform: Windows NT
*
* Note -- This file was generated by the TSM Options File Editor.
*
* =====================================================================
*
* HTTP
*
* ********************************************************************
* HTTPport
*
* Specifies the HTTP port address of a TSM Web interface.
*
* =====================================================================
*
* DEVCONFIG
* ********************************************************************
* DEVCONFig <filename>
*
* Specifies the name of a file that should contain device
* configuration information when it is changed by the server.
* Device configuration information is used by the
* server processes during server database recovery or load and
* DSMSERV DUMPDB processing.
*
* More than one of these parameters may be specified to record
* device configuration information to multiple files.
*
* Syntax
* +------------------+----------------------------------------------+
* | DEVCONFig | filename |
* +------------------+----------------------------------------------+
*
* DEVCONFig "devcnfg.out"
* The previous line was replaced by PREPARE to provide a fully qualified
* file name.
DEVCONF "C:\PROGRA~1\TIVOLI\TSM\SERVER1\DEVCNFG.OUT"
*
* =====================================================================
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin LICENSE.INFORMATION
end LICENSE.INFORMATION
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
end MACHINE.GENERAL.INFORMATION
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin MACHINE.RECOVERY.INSTRUCTIONS
end MACHINE.RECOVERY.INSTRUCTIONS
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin MACHINE.CHARACTERISTICS
Item Value
OS Name Microsoft Windows 2000 Server
Version 5.0.2195 Service Pack 2 Build 2195
OS Manufacturer Microsoft Corporation
System Name GALLIUM
System Manufacturer IBM
System Model eserver xSeries 330 -[867411X]-
System Type X86 - based PC
Processor x86 Family 6 Model 11 Stepping 1 Genuine Intel ~1128 Mhz
Processor x86 Family 6 Model 11 Stepping 1 Genuine Intel ~1128 Mhz
BIOS Version IBM BIOS Ver 0.0
Windows Directory C:\WINNT
System Directory C:\WINNT\System32
Boot Device \Device\Harddisk0\Partition 1
Locale United States
User Name GALLIUM\Administrator
Time Zone Pacific Daylight Time
Total Physical Memory 3,866,068KB
Available Physical Memory 3,558,968KB
Total Virtual Memory 9,660,368KB
Available Virtual Memory 9,196,060KB
Page File Space 5,794,300KB
Page File C:\pagefile.sys
[Hardware Resources]
end MACHINE.CHARACTERISTICS
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
begin MACHINE.RECOVERY.MEDIA.REQUIRED
end MACHINE.RECOVERY.MEDIA.REQUIRED
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
Select the Additional materials and open the directory that corresponds with
the redbook form number, SG246844.
The publications listed in this section are considered particularly suitable for a
more detailed discussion of the topics covered in this redbook.
IBM Redbooks
For information on ordering these publications, see “How to get IBM Redbooks”
on page 393.
Deploying the Tivoli Storage Manager Client in a Windows 2000
Environment, SG24-6141
Getting Started with Tivoli Storage Manager: Implementation Guide,
SG24-5416
Tivoli Storage Manager Version 5.1 Technical Guide, SG24-6554
Tivoli Storage Management Concepts, SG24-4877
Introduction to SAN Distance Solutions, SG24-6408
NIM: From A to Z in AIX 4.3, SG24-5524
Backing Up Oracle Using Tivoli Storage Management, SG24-6249
Backing Up DB2 Using Tivoli Storage Manager, SG24-6247
Backing Up Lotus Domino R5 Using Tivoli Storage Management, SG24-5247
Using Tivoli Data Protection for Microsoft Exchange Server, SG24-6147
R/3 Data Management Techniques Using Tivoli Storage Manager,
SG24-5743
Using Tivoli Data Protection for Microsoft SQL Server, SG24-6148
Using Tivoli Storage Manager to Back Up Lotus Notes, SG24-4534
Backing up WebSphere Application Server with Tivoli Storage Management,
REDP0149
Managing AIX Server Farms, SG24-6606
Index 397
high bandwidth connections 29, 31, 35 df 290, 293
HIPAA 38 fdisk 290, 292–293, 300
hotsite 10, 24–26, 32, 34, 64, 79, 90–92, 154, 180, ifconfig 290, 293
183, 313 tomsrtbt 296
hot-swappable devices 84, 86 logical partition (LPAR) 39
HVAC 82 logical volume 106, 133
Logical Volume Storage Agent 134
Lotus Domino 18
I Lotus Notes 18
IBM Business Continuity and Recovery Services 93
LTO 171
IBM Network Storage Manager 164
IBM SAN Data Gateway 129, 192
IBM Tivoli Storage Manager. See TSM M
incremental backups 17 MAN 155
indirect losses 7 Metropolitan Area Network (MAN) 43
Informix 18 Microsoft Exchange 18
infrastructure planning 85, 95 Microsoft Project 114
infrastructure redundancy 81 Microsoft SQL Server 18
insourcing 60 MIGDELAY 178
instant copy 130 mission-critical 9, 51, 84, 113, 170
intelligent disk subsystem 18 mkisofs 246
IP extenders 89 MSCS 35, 84, 144, 148, 150, 188
iSCSI 156 MTBF 83
ISV 19
IT outages 5
N
NDMP 132
K NetView 89
Kerberos 18 network appliances 132
network architecture 87
network attached storage (NAS) 41, 131, 156, 169
L network bandwidth 143
LAN 41
network boot 228
LAN performance 111
network failover planning 11, 47, 88
Linux
Network Installation Management (NIM) 240, 248
/etc/fstab 293
network monitoring 89
/etc/rc.sysinit 295
Network Recovery Objective (NRO) 11, 52, 58
/etc/sysconfig 290
network redundancy 88, 112
/etc/sysconfig/network 293
network security 90
/proc 290–291
network topologies 111
bare metal recovery 289–290, 295
network transfer rates 42
LILO 304
NFS 251
mini-root 296, 299
NTFS 129, 175
recovery diskette 296
Red Hat 295
system configuration 290 O
system partition information 292 Occupant Evacuation Plan 335
Linux commands OEP 335
chroot 297, 304 offsite data storage 154, 170
dd 298 offsite storage 23–24, 26, 32, 62, 98, 120, 135, 153,
Index 399
ufsrestore 229–230, 234, 236 training and testing 47
SPARC 230 transactions 65
split-mirror backup 126 TSM 68, 80, 84, 92, 164, 228, 230, 241, 248, 268
SRDF 31, 34–35, 87, 106, 152, 189 adaptive subfile backup 16, 101, 124–125
SSA 169 administration 14, 166, 193, 272
standards 85 ADSMSCSI device driver 211
Storage Area Network (SAN) 41 and HACMP 144–145
storage growth 39, 164, 167 and MSCS 148, 150
storage management 164 API 18–19, 170, 285
storage management challenges 39, 164, 171 architecture 98
storage management planning 48, 95 archive 12, 17, 33, 100, 127, 142
Storage Service Provider 26, 92, 319 archive copy group 141
strategic planning 80, 104, 109 backup 17, 100–101, 117
Sunguard 93 backup and restore 12
supporting infrastructure 49 backup methods 124
switches 87 backup set 101, 127
synchronous mirroring 152 backup window 170, 172, 187
SysBack 240, 264 Backup/Archive client 13, 269, 295
system dependencies 49 backup-centric planning 96
system software 85 caching 178
client backup methods 124
client configuration information 135, 137–138
T client data 113, 141
tape automation 170
client interface 15
tape cartridge volume 109
client option set 139
tape device driver 211
client options file 104, 147, 184, 186–187, 247,
tape device failover 189
278, 304
tape drive dual attach 189
client policy 100
tape drive performance 107–108, 169
client Web browser interface 16
tape header information 110
clients 100
tape vaulting 62, 153
clustering applications 144, 188
tar 295
CLUSTERNODE 147, 151
target server 140
code base 164
TCP/IP 41, 141–142
collocation 98, 102, 109, 156, 159, 179
TCPSERVERADDRESS 182
command routing 138, 140
TDP 17, 126
command-line interface 14, 140, 193
TDP for NDMP 132
COMMRESTARTDURATION 146
technology density 81
COMMRESTARTINTERVAL 146
technology selection 80
configuration files 177, 181, 187
telecommunications 93
configuration manager 139–140
TFTP 251, 256
configure DRM 195
The Kernel Group 286
copy groups 100–101, 103, 231
TimeFinder 18, 87, 126
copy storage pool 13, 31, 98, 102, 112,
Tivoli Data Protection. See TDP
120–121, 142, 156–158, 170, 172, 180–181,
Tivoli Disaster Recovery Manager. See DRM under
189, 193, 248
TSM
CPU power 187
Tivoli Storage Management 4
CPU utilization 124
Tivoli Storage Manager. See TSM
data flow 98
TKG 286
data movement 13
Index 401
remote disk replication 151 tape library sizing 109
remote server 141 tape storage pool 107, 110, 178–179
resource contention 187 target server 141, 143
resource utilization 114 TCPSERVERADDRESS 184, 186
restore priorities 190 throughput 167
restore time 172, 179 transaction 117, 173
retention policies 111, 170 UNIX client 231
REUSEDELAY 194, 197 vaulting 92, 153
ROLLFORWARD 174, 176 versioning 100, 110
rollforward recovery log 123, 174 versions 110, 125
SAN exploitation 172 virtual volumes 31, 33, 121, 138, 140, 142,
scalability 164, 167 177–178, 191
scheduled operations 146, 150, 188 volume history 31, 122–123, 153, 177, 181,
Scheduler Service 147, 151 183, 187, 211
scheduling 113, 191 volume management 93, 169
SCSI reserve 150 volume tracking 153, 190, 199
security 18 Web browser interface 14, 140
selective backup 125 Windows 2000 APIs 268
server 13, 117 TSM commands
server consolidation 142 AUDIT LIBRARY 218
server group 140 BACKUP DB 201
server options 121, 123, 177, 187 BACKUP DEVCONFIG 123
server policy 101 BACKUP STGPOOL 102, 200
server rebuild 166, 190 BACKUP SYSTEMOBJECT 277
server recovery 121, 135, 191, 207, 215 BACKUP VOLHISTORY 123
server restart 189 CHECKIN LIBVOL 207
server sizing 113 COPY STGPOOL 121
server-free backup 102, 124, 129–130, 167, DEFINE DEVCLASS 141
172 DEFINE MACHINE 272
server-to-server communication 31–32, DEFINE STGPOOL 193
138–139, 177–178, 191, 198, 207, 316 DISMOUNT VOLUME 202
SHRSTATIC 231 dsmfmt 212
snapshot database backup 122, 175–177, 197 EXPORT 141, 167
snapshot image backup 134 IMPORT 141, 167
source server 140–141, 143 INSERT MACHINE 272, 294, 348
split-mirror backup 126, 130 MOVE DATA 110
SQL interface 173 MOVE DRMEDIA 196–197, 199, 203–204, 207
Storage Agent 128 MOVE MEDIA 196
storage hierarchy 178, 189 MOVE NODEDATA 110
storage pool backup 121, 141, 153, 191, 193, PREPARE 195, 198, 206, 275, 294
200 QUERY DRMEDIA 196–197, 202, 204, 207
storage pool hierarchy 13, 120, 180 QUERY DRMSTATUS 198
storage pool sizing 110 QUERY LICENSE 193
storage pools 120, 156, 178, 189, 248 QUERY MACHINE 274
synchronous storage pool 102 QUERY MOUNT 202
tape device class 107 QUERY RPFILE 143
tape device failover 189 QUERY STGPOOL 194
tape library 187 QUERY SYSTEMOBJECT 279, 284
tape library sharing 167 REGISTER LICENSE 193
V
vaulting 24, 32, 61
VERITAS 19, 228, 286
virtualization 169
vital record retention, archive and retrieval 12
W
WAN 90
warmsite 10, 79, 90–92, 180–181, 315, 317
WebSphere Application Server 18
Windows 2000
Active Directory 268, 284
APIs 268, 276
bare metal recovery 267
configuration information 268, 272
DFS 268
Disk Management 268
Index 403
404 Disaster Recovery Strategies with Tivoli Storage Management
Disaster Recovery Strategies with Tivoli Storage Management
(0.5” spine)
0.475”<->0.875”
250 <-> 459 pages
Back cover ®