Published PAS77
Published PAS77
PAS 77:2006
ICS code: 35.020 NO COPYING WITHOUT BSI PERMISSION EXCEPT AS PERMITTED BY COPYRIGHT LAW
PAS 77:2006
Amd. No.
Date
Comments
PAS 77:2006
Contents
Page
ii .......................... Foreword iii .......................... Introduction 1 .......................... 1 Scope 2 .......................... 2 Terms and definitions 5 .......................... 3 Abbreviations 6 .......................... 4 IT Service Continuity management 7 .......................... 5 IT Service Continuity strategy 13 .......................... 6 Understanding risks and impacts within your organization 14 .......................... 7 Conducting business criticality and risk assessments 15 .......................... 8 IT Service Continuity plan 20 .......................... 9 Rehearsing an IT Service Continuity plan 25 .......................... 10 Solutions architecture and design considerations 27 .......................... 11 Buying Continuity Services 29 36 38 39 43 48 .......................... .......................... .......................... .......................... .......................... .......................... Annex Annex Annex Annex Annex Annex A (informative) Conducting business criticality and risk assessments B (informative) IT Architecture Considerations C (informative) Virtualization D (informative) Types of site models E (informative) High availability F (informative) Types of resilience
51 .......................... Bibliography
PAS 77:2006
Foreword
This Publicly Available Specification (PAS) has been prepared by the British Standards Institution (BSI) in partnership with Adam Continuity, Dell Corporation, Unisys, and SunGard. Acknowledgement is given to the following organizations that have been involved in the development of this code of practice. Adam Continuity Dell Corporation SunGard Unisys
Contributors: Oscar OConnor, Lead Author John Pollard Richard Pursey Andrew Roles Brian Hayden Douglas Craig Stafford Hunt As a code of practice, this PAS takes the form of guidance and recommendations. It should not be quoted as if it is a specification and particular care should be taken to ensure that claims of compliance are not misleading. This Publicly Available Specification has been prepared and published by BSI, which retains its ownership and copyright. BSI reserves the right to withdraw or amend this Publicly Available Specification on receipt of authoritative advice that it is appropriate to do so. This Publicly Available Specification will be reviewed at intervals not exceeding two years, and any amendments arising from the review will be published as an amended Publicly Available Specification and publicized in Update Standards. This Publicly Available Specification is not to be regarded as a British Standard. This Publicly Available Specification does not purport to include all the necessary provisions of a contract. Users are responsible for its correct application. Compliance with this Publicly Available Specification does not of itself confer immunity from legal obligations. Attention is drawn to the following statutory instruments and regulations: Basel II: International Convergence of Capital Measurement and Capital Standards: a Revised Framework, Basel. Bank for International Settlements Press and Communications, 2005. The Civil Contingencies Act 2004. Cabinet Office: The Stationery Office. The Data Protection Act 1998. British Parliament: The Stationery Office. The Higgs Report on the Role of Non-Executive Directors: Department of Trade and Industry: The Stationery Office, 2001 The Sarbanes-Oxley Act, 107th Congress of the United States of America, 2002. The Turnbull Report on Corporate Governance: Department of Trade and Industry: The Stationery Office, 1998 The Orange Book Management of Risk Principles and Concepts: HM Treasury, 2004.
ii
PAS 77:2006
Introduction
This code of practice provides guidance on IT Service Continuity Management (ITSCM). It is intended to compliment, rather than replace or supersede, other publications such as PAS 56, BS ISO/IEC 20000, BS ISO/IEC 17799:2005 and ISO 9001 (see Bibliography for further information).
PAS 56 provides guidance on best practice in Business Continuity Management, and while it mentions the need for IT Service Continuity it does not provide the detailed guidelines found in this code of practice; BS ISO/IEC 20000 provides guidance on best practice on Service Management and, as PAS 56, mentions IT Service Continuity, but not at the level of detail presented in this code of practice; BS ISO/IEC 17799:2005 provides detailed guidance on best practice in information security management, which is one aspect of IT Service Continuity Management. This code of practice does not directly address information security or physical and environmental security as these areas are covered by BS ISO/IEC 17799:2005; ISO 9001 provides guidance on best practice in Quality Management Systems. When implementing any recommendations found within this code of practice, the reader is encouraged to apply the quality assurance and control recommendations found in ISO 9001. Many organizations believe that a loss of systems infrastructure will not happen to them or that a loss of such infrastructure will have a relatively low impact. However, while many of those organizations might believe that they have invested in adequate systems resilience, it is often apparent that such confidence is misplaced. In an age in which information technology is becoming evermore pervasive and increasingly critical within the day to day operations of many organizations, it is clear that the ability to continue to operate with any degree of success is likely to be severely compromised following loss of IT services. In addition it is evident that the duration of a tolerable IT outage is becoming ever shorter. As Figure 1 suggests, there is a continuous cycle in the relationships between several important documents. The IT Strategy defines the organizations key policies and direction regarding information technology, systems and services. From this, the IT Service Continuity Strategy can be defined to ensure that the policies and standards for IT Service Continuity directly and explicitly support the objectives set out in the IT Strategy. This then enables the organization to define its IT Architecture based upon the requirements and objectives set out in both the IT Strategy and ITSC Strategy. Once the architecture is defined, the organization can then define IT Service Continuity Plans for each element of the architecture. Feedback from (amongst many other sources) rehearsing the ITSC Plans can subsequently be used as input to the next iteration of the IT Strategy.
iii
PAS 77:2006
Figure 1 The relationship between the IT Strategy, ITSC Strategy, IT Architecture and ITSC Plan
IT Strategy
IT Architecture
Whilst it is true that major events such as bombs, fires and floods make headline news, the majority of IT related incidents fall into the category of quiet calamities that only affect an individual or a small subset of the organization. Examples of such common incidents include the theft of a mobile workers notebook computer, the failure of an important business application and corruption of important or confidential data. These incidents have the potential to damage an organizations brand or public image and its reputation, not to mention its revenues and customer service. Such damage has the potential to destroy that organization unless appropriate action is taken to implement IT Service Continuity (ITSC). In order to retain an appropriate sense of perspective, this document refers to incidents and events rather than disasters. Since the Asian Tsunami of 2004 and Hurricane Katrina in 2005, the phrase disaster recovery has taken on dimensions previously unknown and the authors felt it was inappropriate to describe the failure of IT systems, however disruptive, using the same language. Throughout the document the reader may encounter terminology which is used in other standards. To avoid ambiguity the reader should refer to the definition section to understand how such terminology is used in this document which may differ from other standards. This document is intended to be read by a number of different audiences:
Executive and Senior Management to gain a high level understanding of the fundamental interdependencies between Corporate Governance, Business Continuity and IT Service Continuity in order to make better-informed investment decisions relating to ITSCM; Middle Management to understand how decisions should be made regarding IT Service Continuity such that critical business processes survive disruption (ideally) or at the very least have the ability to recover from disruption in timescales required by the organization; IT Management to understand the decision making processes required in order to ensure that IT Service Continuity strategies and plans fully support business priorities; IT Support and Operations to gain a practical insight into how IT Service Continuity strategies should be drawn up and implemented in such a way as to add value to the organization as well as protecting it from IT-related incidents; Regulators, auditors, insurance and benchmarking organizations to understand what best practice in IT Service Continuity Management implies for organizations so that these measures can be assessed as part of wider reviews of Corporate Governance and resilience. This code of practice is designed for organizations of all shapes and sizes, whether in the private or public sectors.
iv
PAS 77:2006
It should not be regarded as a step-by-step guide to implementing IT Service Continuity Management but as guidance on the aspects of ITSCM which organizations should consider when investing in this area. Not all activities described herein will be applicable or appropriate for all organizations. In particular, small organizations should aim to use this code of practice as a reference guide in order to help them make informed decisions about what level of ITSCM would be appropriate for them given their individual characteristics. Throughout this code of practice certain terms have been used which may cause confusion. Such confusion is naturally not the intention of the authors, so the following guidance should be borne in mind when reading this document: The term business is used when referring to the non-IT elements of an organization. This should not be taken to imply that this code of practice is aimed purely at private sector or commercial bodies. In each such instance, the term is used merely as convenient shorthand to avoid over-complicating the language used herein. This code of practice refers to rehearsing ITSC Plans. Other publications in this field have referred to testing and also to exercising. The authors regard these terms as largely interchangeable and have opted to use the term rehearsing in this context as it implies not only testing that ITSC Plans are accurate and capable of being implemented, but also that the people required to implement them are guided, supported and provided with feedback on their own personal performance as well as that of the Plans. The authors did not feel that either of the other terms used in other publications quite conveyed the necessary emphasis on this aspect. The term data centre is used to imply any location or facility where core information technology services are housed, whether that be the ultra-modern data centres that major organizations use or under the desk where a one-person business keeps its file server. No inference should be drawn regarding the applicability of guidance or recommendations to any type of non-data centre environment.
1 Scope
This Publicly Available Specification (PAS) explains the principles and some recommended techniques for IT Service Continuity management. It is intended for use by persons responsible for implementing, delivering and managing IT Service Continuity within an organization. This PAS provides a generic framework and guidelines for a continuity programme including the following topics. What the required management structure, roles and responsibilities for implementing IT Service Continuity management are. How business criticality, risk assessments and business impact assessments should be performed to produce useable results. What business continuity plans contain and the steps required to respond to, and recover from, the identified risks within the context of specified business processes. How the development, rehearsal and deployment of the Business Continuity plan does not have to cost more in terms of money, risk or reputation than taking no action. Why a framework and capability should be developed for the organization to respond effectively to unexpected disruption. This document is not intended to be used as step-by-step instructions for conducting any of the activities described herein. It is intended to provide an overview of a complete process on the assumption that information will already exist within the organization that would be identified by activities described in this document. Where this is the case, users of this document are encouraged to review the information in their possession to ensure that it includes all of the details required, and that it is up-to-date and accurate.
PAS 77:2006
2.10
abnormal service level of service that deviates from the levels agreed for normal operations NOTE Usually as a result of an incident causing disruption to normal service levels.
2.1
data availability measure a systems ability to deliver a predetermined level of data access during a system failure
action plan schedule of activities, lead times and dependencies of activities in order to address a particular requirement
2.2
dependency modelling activity used to determine the inter-relationships and dependencies between functions and/or processes and how they affect the system or organization as a whole
2.11
2.3
asynchronous replication periodic physical replication of data from one storage system to another NOTE Typically over a wide area network.
disk imaging method of copying a complete hard disk of a computer into a single file from which the gathered image can be distributed to a single or multiple computers to minimize the time and effort for the creation of computers that will have identical software and configurations to the original
2.12
2.4
atomic requirement, transaction or objective which is self contained i.e. cannot be broken down further
domain logical association of a defined environment and the assets within the pre-defined environment
2.13
audit log shipping automated process for transferring records of transactions (audit logs) between primary and secondary systems
2.5
downtime vs. cost vs. benefit model model which analyses the costs of downtime and of the measures required to minimize downtime in the event of an incident and compares them against the benefits available to the organization from services being resumed
2.14
business continuity management plan document that sets out to ensure resumption of critical business functions in the event of either an incident or unforeseen event that threatens the business
2.6
duplexed ability to simultaneously send and receive data through a medium in both directions NOTE When used to describe disk devices or disk connectivity it implies duplication or mirroring.
2.15
clustered system two or more computer systems configured in such a manner that in the event of failure of a system or service run on it, operation is transferred to another system within the cluster
2.7
2.16
2.17
fail-over ability for services offered by a component, server or system to automatically be undertaken by another component, server or system in the event of its failure so that the impact of losing that device, server or system has a minimal impact on the service or services offered
2.8
cold back-up site provides the space but not the infrastructure needed to resume operations quickly
continuity procedures set of predefined procedures to be followed in the event of an incident which disrupts normal service levels
2.9
PAS 77:2006
failure modes and effects analysis (FMEA) structured quality method to identify and counter weak points in early conception phase of products and processes1)
2.18
NOTE Examples of IT services include messaging, business applications, file and print services, network services, and help desk services3).
2.19
incident event that disrupts normal IT services NOTE This usage differs from that in ITIL [1].
incident recovery activities required to respond effectively to an incident, with the primary objective being to ensure the resumption of normal service levels
2.20
IT Service Continuity Management supports the overall Business Continuity Management process by ensuring that the required information technology technical and services facilities (including computer systems, networks, applications, telecommunications, technical support and service desk) can be recovered within required, and agreed, business timescales
2.26
2.21
I/O Processors allow servers, workstations and storage subsystems to transfer data faster, reduce communication bottlenecks, and improve overall system performance by offloading I/O processing functions from the host CPU2)
last mile telecoms provider organization responsible for the provision of telecommunications services from the national or local telecommunications infrastructure to a specific location
2.27
2.28
2.22
latency delay due to the time it takes to transmit data from one location to another
IP Address logical address of a system within an IP network NOTE The IP address uniquely identifies computers on a network. An IP address can be private, for use on a Local Area Network (LAN), or public, for use on the Internet or other WAN. IP addresses can be determined statically (assigned to a computer by a system administrator) or dynamically (assigned by another device on the network on demand).
maintenance procedures procedures applied by an organization to ensure that their IT Infrastructure is maintained in optimum condition through both proactive and reactive measures
2.29
2.23
IT Architecture overall design of an organizations information technology and services including both physical and logical entities
monte carlo analysis means of statistical evaluation of mathematical functions using random samples, often used in risk analysis of highly complex systems
2.30
2.31
2.24
IT Infrastructure physical devices which comprise an organizations information technology and services architecture
Network Attached Storage (NAS) storage device that can be attached to the network for the purpose of file sharing NOTE In essence a NAS device is simply a file server.
IT Service set of related information technology and probably non-information technology functionality, which is provided to end-users as a service
2.25
2.32
network protocol technological rules, codes, encryption, data transmission and receiving techniques which allow networks to operate
PAS 77:2006
operations bridge central facility used for monitoring and managing systems, services and networks
2.33
2.42
paper test mechanism for proving the hypothetical effectiveness of a process by working through scenarios in a discursive forum
2.34
risk management plan document that sets out to define a list of activities, lead times and dependencies in order to mitigate one or more identified risks
2.43
risk mitigation set of actions that will affect either the probability of the risk occurring or its impact should the risk occur. These are summarized as risk transference, tolerate the risk, terminate or treat
2.35
point in time (PIT) consistent copy of the data taken at the same instance in time for one or more systems
2.44
recovery procedures procedures which result in the restoration of services following an incident
2.36
risk monitoring iterative process of the risk owner checking and reporting on any changes in status of the risk log in terms of risk proximity, impact and response
2.37
redundant routing resilient approach to data networking in which there are a minimum of two routes from each node in the network
rehearsing the critical testing of ITSC strategies and ITSCs, rehearsing the roles of team members and staff, and testing the recovery or continuity of an organizations systems (e.g. technology, telephony, administration) to demonstrate ITSC competence and capability NOTE A rehearsal may involve invoking business continuity procedures but is more likely to involve the simulation of a business continuity incident, announced or unannounced, in which participants role-play in order to assess what issues may arise, prior to a real invocation.
2.38
stateful/stateless describe whether a computer or computer program is designed to note and remember one or more preceding events in a given sequence of interactions with a user, another computer or program, a device, or other outside element NOTE Stateful means the computer or program keeps track of the state of interaction, usually by setting values in a storage field designated for that purpose. Stateless means there is no record of previous interactions and each interaction request has to be handled based entirely on information that comes with it. Stateful and stateless are derived from the usage of state as a set of conditions at a moment in time. (Computers are inherently stateful in operation, so these terms are used in the context of a particular set of interactions, not of how computers work in general).
2.45
2.39
replication appliance device which provides functionality to replicate data to other storage systems
2.46
storage array two or more hard disk drives working in unison to improve fault tolerance and performance
2.47
2.40
synchronous replication instantaneous physical replication of data from one storage area to another, typically over a high speed interconnect such as fibre channel
risk combination of the probability of an event and its consequence [ISO Guide 73:2002]
risk communication exchange or sharing of information about risk between the decision-maker and other stakeholders [ISO Guide 73:2002]
2.41
test scripts definition of the specific tests to be enacted when proving the functionality and operation of a system or service
2.48
PAS 77:2006
vulnerability report report which identifies the specific vulnerabilities of a specific system or service
2.49
3 Abbreviations
For the purpose of this PAS, the following abbreviations apply. BCM Business Continuity Management BCMP Business Continuity Management Plan BCMT Business Continuity Management Team BCSG CMT DAS IMT I/O IT ITIL ITSC NAS OS RAID RPO RTO SAN UPS WAN Business Continuity Steering Group Crisis Management Team Direct Attached Storage Incident Management Team Input/Output Information Technology (also includes Information Systems (IS)) Information Technology Infrastructure Library Information Technology Service Continuity Network Attached Storage Operating System Redundant Array of Independent Disks Recovery Point Objective Recovery Time Objective Storage Area Network Uninterruptible Power Supply Wide Area Network
2.50
work schedule defined set of activities and deliverables which, once completed, will result in the desired outcome of a procedure or project
2.51
Zero Data Loss (ZDL) remote replication method that guarantees not to lose any live data
zoning allocation of resources for device load balancing and for selectively allowing access to data only to specific systems NOTE Zoning allows an administrator to control who can see what is in a SAN.
2.52
PAS 77:2006
PAS 77:2006
PAS 77:2006
The ITSC strategy should enable the organization to plan for and rehearse the whole life cycle of a major incident from the point of initial disruption, through the recovery, to abnormal service to the point where normal service levels are once again guaranteed. The strategy should be developed from a clear understanding of the organizations need for IT services and the agreed service levels that are required from time to time, taking into account: a) priority for key business units at given moments in time; b) peak loads on business; c) strategically important business periods e.g. reporting periods, manufacturing deadlines etc; d) compliance with business Continuity Management Plans and objectives; e) investment vs. risk; f) impact of failure or loss; g) recovery time objectives; h) acceptable levels of downtime and performance; i) system changes and upgrades; j) new projects; k) interdependencies;
l) compliance with legislation; m) deadline management; n) rehearsing and rehearsing recovery plans; o) data protection; p) data availability; q) plan maintenance; r) education and awareness programmes for all IT staff. The strategy should not define the detailed tactics but should set the direction of the individual components of an ITSC plan.
Monitor
Understand Dependencies
PAS 77:2006
The six elements in the ITSC Strategy Process are: a) Understanding the business requirements and agreeing service levels The ITSC strategy should be aligned with corporate strategy and in line with the pre-defined business goals of the organization. To provide a sound base to start from and allow for continued and controlled growth of the ITSC strategy a defined service level should be agreed. This will allow for clearly defined service levels which are specific and can be measured, analysed and improved on. NOTE Refer to ITIL [1] for guidance on agreeing service levels. b) Reviewing the IT strategic vision and updating objectives Change is inevitable and allowances should be made to ensure that the ITSC strategy is always aligned with the overall IT strategy and business goals. The overall IT strategy and its ITSC strategy should be constantly assessed and updated. c) Performing regular risk assessment and dependency modelling (internal and external) Changes in the nature of the risks and dependencies should be expected and in order to ensure that these variations are controlled and taken into cognisance in the ITSC strategy they should be reassessed regularly. The quantity and quality of the regular dependency modelling exercises will depend on the nature of the environment. d) Building and embedding an ITSC culture from new project inception When initiating an IT project consideration should be given to how the deliverables will support or enhance the ITSC strategy. e) Exercising, maintaining and auditing continuity and recovery plans The continued rehearsals, maintenance and auditing of the continuity and recovery plans an organization can ensure that: 1) the plans are constantly improved on and up to date; 2) staff are familiar with the plans contents; 3) the plans are fit for purpose and relevant; 4) resource requirements for management of major incident are understood and planned for; 5) visibility and governance are provided for the teams exercising the plans and the organization as a whole. f) Monitoring performance Constant improvements can only be made if the pre-defined service levels in the first stage are constantly monitored, measured, analysed and reported on. Monitoring performance should be central to this. Everyone within the organization who is likely to participate should be reminded of the importance of initiating an ITSC programme and the priority it should be afforded. NOTE Experience shows that a better quality of response is achieved if appropriately worded instructions are received from
an executive level within the organization. This is particularly true within larger organizations where a variety of individuals and levels of management are likely to be involved. A suitable mandate should be sent to the relevant department heads and others who may be involved in the programme as a matter of priority before the commencement of the programme. The business should define the levels of service it expects from the IT department. It should clearly define priorities, allowing the Head of IT to determine which services have greater protection, resilience and redundancy over others. IT resilience costs money. The depth and detail of an ITSC strategy will for many companies be driven by risk versus cost. A strategy should be developed that delivers improvement over time focusing on key issues and priorities where risks are high and the impact of loss or failure significant. This in itself is not a simple algorithm and will require an impact and risk analysis before budgets can be assigned and a strategy ratified. An initial strategy should be chosen by defining a strategy around the priorities and calculating the associated costs for delivery or by allocating a budget and then building a strategy around the available budget. NOTE It is rare that an organization will have implemented an ITSC plan from the outset of IT infrastructure implementation. Many IT environments have evolved over time and are often a combination of a number of inconsistent strategies.
PAS 77:2006
expenditure on IT resilience. It should agree on its policy on outsourcing risk management to third party suppliers e.g. incident recovery companies and third party maintenance organizations, (see Clause 11).
Gold Crisis Management Team Silver Business Continuity Management Team Bronze Incident Management Team
10
PAS 77:2006
Depending upon the seriousness of the incident, ITSC should be managed at both Silver (BCMT) and Bronze (IMT) levels, with Bronze teams being established for each IT service, coordinated by a dedicated IT service Silver team. The Gold team (CMT) should provide high level direction, prioritisation and coordination and should take sole and direct responsibility for communicating with external stakeholders, including the media, emergency services and public authorities where that is appropriate or required. Some organizations (mainly large) have internal teams specializing in Corporate Risk and Business Continuity Management from whom support and assistance should be sought when constructing, rehearsing and invoking ITSC Plans. As these teams are most likely to comprise members of the organizations management team, i.e. not business continuity or incident management specialists, it is possible that for both rehearsals and live activations additional specialist support may be appropriate. Consideration should also be given to the resource requirements implied by the activation of these teams and their associated ITSC Plans, since there is a possibility that more resources will be required during the initial stages than are directly available within the organization. 5.5.2 Incident Management Team In the event of disruption, whether from a predicted source or elsewhere, the IMT for the affected IT service should be activated. The IMT should determine whether the nature and extent of the disruption warrant the deployment of the relevant BCMP, and if so should: a) Determine the nature of the disruption and, if necessary, coordinate with the organizations Problem Management4) function to adapt existing procedures for the initial response to the disruption. b) Implement the selected procedures, securing the required resources through the BCMT where appropriate. c) Identify, and where appropriate adapt, the relevant continuity procedures to ensure that the business continues to operate as near to normal a manner as possible for the duration of the disruption. All such activities should be coordinated through the BCMT. d) Identify, and where appropriate adapt, the relevant recovery procedures to ensure that the business recovers from disruption in a timely and controlled manner once the root cause of the disruption has been
eliminated. All such activities should be coordinated through the BCMT. e) Activate the BCMT who investigate the requirement for further Bronze teams to be activated. Whilst the IMT is active, all activities should be coordinated through the BCMT to ensure that no action taken by one IMT conflicts with actions taken by others. f) Communicate with all parts of the organization affected by the disruption on a regular basis regarding progress and the actions initiated by the IMT. g) Organize, once recovery actions have been completed, a thorough review of its management of the disruption so all relevant lessons from the experience can be learned and incorporated into procedures and training programmes. 5.5.3 Business Continuity Management Team The BCMT should determine the nature and extent of the disruption and should: a) coordinate the activation and management of all relevant IMTs; b) coordinate the allocation of resources to IMTs; c) coordinate the management of undisrupted business; d) manage communication with regulators, investors, the media, associates and staff; e) ensure that the active IMTs have all the facilities, people and other resources that they require to mount effective response, continuity and recovery operations; f) where appropriate, activate the CMT. 5.5.4 Crisis Management Team The CMT should be activated when an incident, or a combination of incidents, have such wide ranging impact as to require organization-wide response coordination. When active, the CMT should take responsibility for the coordination of all active BCMTs and for all communication with all external stakeholders, especially customers, suppliers, regulators, the media and the public. 5.5.5 Learning lessons In order to ensure that each real or rehearsed invocation of the ITSC management contributes to the ongoing improvement of the ITSC strategy and related plans, each team should maintain a comprehensive journal, including details of: a) the reasons for the team being activated, including details of the disruption that occurred, and justification for the team being activated; b) any amendments made to the ITSC strategy and related plans and procedures as a result of the actual disruption being different in character to that predicted; c) all decisions made during the disruption, including supporting evidence;
4) http://whatis.com
11
PAS 77:2006
d) all events transpiring during the disruption, their effects and likely causes; e) all actions taken and evidence of their results; f) all communication in relation to the disruption, including the other parties involved, the nature of the communication and what information was passed in each direction. This journal should cover the period from the time the team is activated to the time it stands down. All entries in the journal should include details of the date and time the entry was made, and by whom. The completed journals should be used to support future reviews of business and IT service continuity plans and their effectiveness. Therefore, stringent change control should be applied on these journals, and no changes of any nature should be permitted once the team has stood down.
e) Service levels (e.g. uptime statistics) should be reviewed as a Board agenda item each month. Trend analysis can show even a slight decline in service which can be an indicator of bigger problems. f) Testing and rehearsing contingency and recovery plans should be an essential ingredient to keeping an ITSC strategy current. Ensuring a department or application can be recovered fully, after failure can ensure simple errors and problems are minimized. This includes performing complete data back-ups as well as testing third party suppliers. g) Suppliers ability to maintain appropriate levels of service should be regularly assessed. Including suppliers such as incident recovery and maintenance providers in the change management loop is highly recommended. h) Remunerating staff against service levels can help ensure the relevant level of awareness reaches all levels of the organization. i) An internal/external audit of plans.
12
PAS 77:2006
13
PAS 77:2006
Strategic
Tactical
NOTE If a system or service cannot readily be assigned to any of these categories, the organization may wish to consider whether that system or service has any ongoing purpose. If however a system or service can be assigned to more than one category the organization should decide on which single category designation will be used.
The process of assessing business criticality and risk should be managed to ensure that the assessment of physical risks is coordinated with, but not dominated by, the assessment of organizational risks. Neither assessment is more important than the other, but each has its part to play in ensuring that the business as a whole adopts a position in which all types of risk are managed as effectively as possible. The inherent complexity in all organizations implies that any risk assessment method should be adaptable to the different circumstances within each part of the organization. NOTE See Annex A for details on how to conduct business criticality and risk assessments.
14
PAS 77:2006
15
PAS 77:2006
staff will understand how to operate the systems on a day-to-day basis, they may not know how to recover the applications and databases in the event of major incident. Make the procedures as comprehensive as possible and maintain them up to date when the systems change. 8.4.3 Initial response to an incident invocation of ITSC procedures If the organization has an IT service desk, the IMT should be contacted when the service desk is first made aware of an incident. If there are many members of the IMT, then a cascade process may be adopted. Either way the person making the initial contact should record who has been contacted and the response they received. If a person on the IMT is unavailable then their responsibilities should be fulfilled by their designated deputy within the team. In any instance, leaving phone messages is not a sufficient response to the incident. NOTE There may be a complex set of instructions or simply instructions to contact the members of the IMT. Depending on the seriousness of the incident the IMT may opt to activate either or both of the BCMT and CMT. If the initial assessment concludes that the IMT can manage the incident directly then the other groups can be stood down. The rules and decision making criteria for activation and escalation will be organization specific and should be developed as part of the ITSC plan. NOTE How the organization defines an incident will impact how you escalate the problem. For example, one site defines a disaster as an event which is likely to render the whole site unavailable for a considerable length of time. A major incident is defined as an event which is likely to render a single or multiple systems as unavailable for a considerable length of time. Any event that is outside of these definitions is handled as a business as usual event that is handled in the normal way by their service desk. 8.4.4 Problem Assessment The IMT has the responsibility of assessing the impact of the incident. Where possible, assessment should be made by those with the most domain or system knowledge. Critical time can be lost to prevarication or indecision over whether systems should be failed over. Likewise critical time can be lost by failing over the system to a remote site too quickly when a simple local recovery or waiting for a system to be repaired would have been sufficient. The IMT should develop a set of detailed criteria based on past experience and escalate based on whether the current situation meets those criteria. Where the IMT identifies a potential impact upon the organization beyond IT services, it is its responsibility to activate the BCMT using the organizations defined processes.
8.4.5 Roles and responsibilities The plan should include a full description of the predefined teams: the IMT, and the constituent specialist recovery teams for platforms, services and facilities. These should also contain each members role and responsibility and current contact information. NOTE 1 Any documents containing phone numbers should constantly be updated. Electronic documentation that is linked to directory systems is useful for keeping the plan up-to-date provided the directory systems are resilient enough to withstand incidents. NOTE 2 IMT members should be geographically dispersed in order to withstand environmental incidents. 8.4.6 Procedures to follow These procedures should be prepared in readiness for this document and should be frequently updated as a result of rehearsing and actual invocations. If a site has multiple IT systems, there should be multiple procedures which form part of the overall recovery procedures. NOTE Procedures could be developed as hyperlinked electronic documentation connected via a single high level index. This has the advantage over paper since new versions of sub-documents can be released without having to replace all the documents in the set and everyone has access to the most up-to-date information. All procedure documentation should be readily available to all points of the enterprise, even in an incident scenario. It could be a good idea for IMT members to keep the latest copy of the plan where they will always have access to it e.g. at home or on their company laptops. The contents of the plan(s) are likely to contain sensitive or confidential information and should always be held securely, with appropriate measures taken to ensure that the contents cannot be accessed by unauthorised personnel (see BS ISO/IEC 17799:2005 for further information on the kinds of measures which can be appropriate). Following a major incident there may not be the time or equipment necessary to print copies of plans, so any documentation you create should be either easy to read from a screen, or printed out on a regular basis. The procedures may take many forms. One useful form is a flowchart that shows the various possible high-level steps that should be followed and decisions that should be made. A common form of this process flowcharting is known as a swim-lane diagram. Each lane represents an individual recovery process for one system. Figure 5 shows the highest level process flow for a financial institution which runs only three financial systems: savings, mortgages and insurance. How each system is recovered depends on the chosen architecture of each system. In this example, the savings system uses
16
PAS 77:2006
synchronous remote mirroring of the savings database. Recovery takes the form of enabling the remote mirrors on the remote system, recovering the database environment and then allowing branch traffic to access the system from the remote site. The mortgage system uses a combination of tape back-ups and audit log shipping. To recover this environment, first reload the last known copy of the database from tape and then bring it up to date by reapplying the audit records read from the
audit logs. The insurance system is a high availability clustered system which automatically fails-over to the back-up site to provide almost uninterrupted service. NOTE In this example there are no interdependencies between the individual systems. This may not be the case in reality. Quite often one system needs to be recovered before another can be brought on line.
Figure 5 Example of a high level process flow chart for service continuity management
Disaster / Major Component Event
Assess scale of disaster Switch all branch networks to remote site Re-route help desk and operations calls to remote site
17
PAS 77:2006
Each process in this flow chart should be documented separately, with its own flowchart if necessary highlighting each task that forms the process. The documented procedures should provide detailed step-by-step instructions. The level of detail required in the plan will
depend on the skill level of the intended audience. Each task shown in the top level process flow chart should be accompanied by a summary sheet containing the items shown in Figure 6.
Call-in the fail-over operations team Contact remote site on-call operations staff and request extra coverage at the remote site. Current remote site Operations Contact List Contact-List.doc Emergency Call Out Procedure Emergency-Call-Out-.doc Wolverhampton Back-up Site Remote Site Operations Support Manager A-3 10 minutes Incident Management Team (BCM Manager)
Essential documentation:
Action takes place at: Task completed by: Preceding tasks: Time to complete task: Requestor:
Full description/reason for action: There is a need to provide full operations coverage at the remote site to augment normal skeleton staff. Thus need to invoke emergency on-call procedures for operations. Status check: Signature Status and Comments: Ensure that the section below is completed and signed-off Name Time
8.4.7 Fail-back Although it may not be possible to plan for all post failover scenarios, where for example there has been total devastation of the production site, basic planning should be undertaken and the high level steps understood. When returning service to the original system or site then detailed plans should be created for the fail-back process. In these circumstances it is unlikely that fail-back will be a
straightforward reversal of the fail-over steps and a separate set of procedures are likely to be required. Thus a full fail-back plan should be in place with the same quality and standard of documentation as for the fail-over. Figure 7 shows an example fail-back plan for the fictitious fail-over considered in Clause 8.4.6.
18
PAS 77:2006
End fail-back
19
PAS 77:2006
20
PAS 77:2006
d) How can staff be trained to cope with the situation if they do not experience it in rehearsal-mode? e) Once the BCMP is in operation, how will you return to normal business operations? Are there specific issues here that warrant rehearsing in their own right? f) How different are the circumstances of an actual invocation likely to be relative to those of a rehearsal? NOTE For example it may be advisable to use copies of live systems and data in a rehearsal, the emotional environment of a rehearsal is likely to be more relaxed than in a real incident, etc.
considered. The frequency of exercises will depend on the individual circumstances of your organization but accepted best practice is to exercise plans at least once a year. a) Callout rehearsals should be conducted regularly, in addition a surprise callout rehearsal should be conducted involving all departments and the IMT. b) Walk through reviews of recovery plans, emergency management plans and departmental plans. c) Scenario-based walkthrough exercises for IMT, support teams and individual departments. d) Component rehearsing (e.g. individual departments, business processes, IT systems, voice and data network links, etc). For instance when new systems are implemented, when there are previous rehearsal failures, when changes occur or for previously unrehearsed components. Component testing should also be considered during periods when a more comprehensive test cannot be completed, e.g. test that network traffic can be redirected to the fail-over site, that users can connect to the fail-over site and that live data can be restored at the fail-over site. e) Integration rehearsals (e.g. multiple systems and/or business processes) where IT services rely upon combinations of information systems working together the organization should reassure itself that they are capable of not only recovering the individual systems but also that they can be recovered in such a way as to provide the required services by interacting as expected. f) Relocation rehearsals (technical and business recovery), whereby key parts of the business relocate to, and operate from, the recovery site, including the loss of the main facility, an IT switch or critical business processes. g) Fail-over rehearsals of the live IT environment to the recovery site (including verification by users) and business relocation rehearsals. h) Major incident simulations should include scenariobased role playing exercises, IT fail-over, business relocation and full fail-back rehearsals. In all cases, results should be documented and updates to appropriate continuity plans completed within four weeks of each rehearsal. All rehearsing should be carefully managed and coordinated to ensure low risk to the business but with maximum return on the effort put in.
9.5 Strategy
To achieve the organizations ITSC objectives, a combination of the following recommendations should be
21
PAS 77:2006
The suggested roles are as follows: a) Business Continuity Coordinator: the key facilitator of the Business Continuity function. b) Compliance/Audit: to oversee recovery rehearsals and exercises and to ensure they meet the regulatory requirements and satisfy external auditors. c) Business Continuity Steering Group (BCSG): oversight committee for the entirety of the business continuity function consisting of senior representation from all business areas, to reflect the business-wide impact of business continuity planning and management. NOTE As part of the rehearsal strategy, the organizations Business Continuity function should maintain a rolling rehearsal schedule. The Business Continuity Steering Group should sign off the rehearsal programme as part of this document being issued. d) IT Rehearsal Working Group: responsible for planning technical IT aspects of recovery rehearsals.
e) Business Continuity Rehearsal Group: is chaired by the Business Continuity Coordinator and including representatives from the IT Support Groups and Compliance/Audit. The Business Continuity Rehearsal Group reports to the BCSG. The Business Continuity Rehearsal Group is responsible for: 1) planning and executing all ad hoc infrastructure rehearsing, and regular full scale service continuity rehearsal simulation rehearsals; 2) agreeing the rehearsal scope and objectives with the business, via the BCSG; 3) pre-rehearsal planning and preparation; 4) production of the rehearsal plan document; 5) coordination of activities during the rehearsal; 6) post rehearsal reporting; 7) follow-up of actions arising.
22
PAS 77:2006
The Business Continuity Rehearsal Group should be business-led, rather than an IT-led group. The Business Continuity Rehearsal Group should meet regularly, as required to meet the above responsibilities. Typically, this will be monthly, but increasing in frequency in the weeks before a rehearsal.
9.7.3 The importance of rehearsing Rehearsing is a vital part of the long term BCM lifecycle, which will prove the viability of recovery plans and highlight areas for further improvement. It also provides an ideal training opportunity for those involved in the key activities. Rehearsals are so called so that areas of weakness can be identified and new processes implemented to improve resilience. It is crucial that rehearsals are seen as positive tasks and any internal political influences are eliminated so that the focus of business resilience and continuity is maintained. The overall aims of the rehearsing strategy are to ensure effective crisis management and to enable live processing to be moved to the recovery site(s) on a regular basis and become part of business as usual. NOTE Even the most comprehensive rehearsal does not cover everything. For example in a service disruption where there has been injury or even death to colleagues, the reaction of staff to a crisis cannot be rehearsed and the plans should make allowance for this. Rehearsals should have clearly defined objectives and critical success factors which will be used to determine the success or otherwise of the exercise as well as of the BCP itself. A full rehearsal should replicate the invocation of all standby arrangements, including the recovery of business processes and the involvement of external parties. This should test completeness of the plans and confirm: a) time objectives, e.g. to recover the key business processes within a certain time period; b) staff preparedness and awareness; c) staff duplication and potential over commitment of key resources, during invocation of the BCP; d) the responsiveness, effectiveness and awareness of external parties. Rehearsals may be announced or unannounced. However, in the latter case the senior management should approve the announcement in advance otherwise it may be difficult to achieve commitment. 9.7.4 Rehearsal objectives The rehearsal strategy should meet the objectives to: a) validate emergency callout procedures and contact details contained in the recovery plans; b) ensure key staff are familiar with their Incident Management, Business Recovery and Technical Recovery plans;
23
PAS 77:2006
c) prove the ability to recover the technical IT and communications infrastructure; d) prove the ability of critical staff to relocate to and work from the nominated recovery site(s); e) validate the effectiveness and accuracy of the documented IT and Business Recovery plans. 9.7.5 Planning a rehearsal All parts of each rehearsal should be planned in advance as without the planning and preparation the following could occur: a) objectives will not be met and live systems could be adversely affected; b) the rehearsal could fail which will cause the staff involved to disassociate themselves from Business Continuity and Service continuity rehearsal; c) the identified resources (staff and other) may not be available when required or may not be appropriate, such as skill sets, adequate communications link, and server specification; d) there is nothing to measure progress against and therefore no opportunities to improve the rehearsing process; e) expectation of the organizations staff and customers may not be met or remain unknown. NOTE In many ways each rehearsal can be viewed as a project in that it has defined start and end points and should have agreed objectives and desired outcomes. For guidance on best practice in project management and planning the reader should refer to PRINCE2 [2] and/or the Project Management Institutes Project Management Body of Knowledge [3].
24
PAS 77:2006
Site
Site recovery Site/data centre failover Application failover/load balancing Redundant systems SAN, NAS & DAS Backup and restore
Application
Data
Platform
If the IT architecture is changed to support ITSC then this should be checked to ensure it does not compromise continuity or security. Thus a review of the complete environment should be undertaken to ensure security is maintained at the same level. This should include a thorough examination of alternative/back-up sites and network links between them. The following should be considered: a) Is the replication of data exposing client data? b) Are the Service continuity rehearsal plans secure or could these be used to identify weaknesses in the IT architecture? c) Are there unused service continuity rehearsal Internet Protocol (IP) addresses, which during normal operation a hacker could use to gain access to the network? The classic approach to ITSC is to use a two-site model which has a back-up site that can continue to provide a service when the main site is disabled or destroyed by an incident. There are a number of ways in which this remote
site model may be implemented (see Annex D), depending upon the organizations requirements.
25
PAS 77:2006
multiple independent systems with automated fail-over between hosts. There could be issues relating to the sharing of databases (see D.2). Clustering and database sharing should be implemented if there are concerns around hardware or even software stability. Any application resiliency mechanisms should ensure recovery of data to consistent points. For example if a database has data on one volume and the indices on another, then the application should ensure that updates to the disks are either all applied or none applied i.e. the update is atomic. Databases that are resilient in this way are said to adopt Atomic, Consistent, Isolated and Durable (ACID) properties. A stateless server is one that provides a service but retains no transaction state information between interactions from the client. Each transaction is atomic e.g. self contained and has no relation to preceding or following interactions. An example of this type of server is a web server, web applications are typically stateless. Naturally stateless servers are good candidates for the creation of server farms: large groups of servers that all offer the same level of service. When optimum load is exceeded then another server running the same stateless server software should be added.
26
PAS 77:2006
5) http://www.cips.org
27
PAS 77:2006
the same resource with another client of the supplier in the same building, street or close area.
Another consideration is that of the physical and environmental security measures which are in force in the recovery site. These should be equivalent to those for the primary location and should be regularly audited against specific and detailed requirements.
11.7 Rehearsing
An ITSC plan should always be rehearsed to ensure it is current and appropriate to meet the required ITSC service levels. It is crucial when procuring continuity services from a supplier that the services are rehearsed. A supplier will have a finite amount of resource (both equipment and people). It is important when buying a service that the supplier's resources are known to the buyer to help it gauge the chances of service provision when an incident or interruption occurs. This information should be made readily available by the supplier; however, one way to gauge the amount of available resource is to request scheduled and unscheduled rehearsals. If a supplier is under-resourced to meet its contractual obligations, it is unlikely to be able to honour short timescale scheduled rehearsals. Should this happen, then alarm bells should be ringing as a lack of resource in a rehearsal when all is relatively calm and quiet, is likely to mean over-stretched resources and over-syndicated services. If the doubt is there, the buyer should ask deeper questions to ensure its own risk management levels have not been compromised.
28
PAS 77:2006
Response Selection
Response Planning
NOTE Where systems and/or IT services are involved in safety critical environments, such as on oil rigs, nuclear power plants etc., more sophisticated approaches to risk management such as Monte Carlo Analysis may be more appropriate.
29
PAS 77:2006
temporary failure of an information flow from another business process; c) plant or equipment; d) buildings and environment; e) information technology or systems; f) information security including confidentiality, integrity and availability; g) projects including risks associated with not delivering the specified solution, risks associated with the solution and risks associated with its delivery. In assessing the types of risk to which a physical or organizational component of the business could be subject, the assessment should be well informed and based on verifiable evidence. Where possible and appropriate, the views of acknowledged experts should be called upon to ensure that the assessment of the nature and likelihood of a particular risk is as realistic as possible. All risks identified during this activity should be described in the ITSC plan. At this stage it is only necessary to record summary details for each risk including a name, which should convey something of the nature of the risk, and one or two sentence description of the nature of the risk. The probability of a risk occurring and its likelihood should be determined according to Table A.1.
30
PAS 77:2006
In performing a risk assessment one should identify not only the immediate effects of the risk occurring but also the impact on the business of those effects. For example, the effect of a hard disk problem could be the corruption of some data stored on that disk, whilst the business impact of corrupt data relating to customer accounts could result in significant cash flow problems and could also adversely effect the organizations reputation for excellence. In general, the assessment of each risk should consider the impact on: a) environment; b) financial performance of the organization; c) health and safety of employees and the public; d) morale of employees; e) productivity and process efficiency;
f) product quality; g) business controls; h) regulatory or legislative compliance; i) reputation of the organization with its customers, investors, staff and suppliers; j) political impact at local, regional, national and international level. When assessing the impact of a risk one should ensure that the assessment is well informed and based upon verifiable evidence, hence, expert opinion should be called upon where possible and appropriate to do so. Table A.2 categorizes the impact of a risk.
31
PAS 77:2006
Is the risk sufficiently likely or its impact sufficiently significant to justify implementing the response? Would a decision not to develop a response leave the organization or its officers open to civil or criminal litigation? Would the benefits (both in terms of risk mitigation and other consequential improvements) to the business from implementing the response outweigh both the costs of taking no action and the costs associated with the implementation?
If not, consideration should be given to alternative approaches which cost less to implement or in some cases whether the organization is prepared to accept the risk.
In order to ensure that Risk Management represents a viable and positive investment for the future of the business, a cost-benefit analysis for each possible risk response should be conducted. The objective of this exercise is to determine whether the benefits of taking action will outweigh the costs of taking no action. This analysis is then fed into the decision making process for selecting the responses to be implemented. From the risk profiles (see A.2), documented in the ITSC plan, obtain details of the financial costs of: a) the estimated cost of taking no action in the event that the risk occurs, i.e. the impact cost; b) the estimated development and implementation costs of existing and new counter-measures; c) the estimated costs that would be prevented or averted by implementing the proposed counter measures. In addition to these financial costs, other factors should be
taken into account, such as the organizations reputation, employee health, safety and morale, environmental protection, security and the confidence of investors, customers and regulators. In each case, an estimate of the impact on the intangible factors should be made for taking no action, for preventing the risk and for implementing the proposed counter-measures. By examining the intangible costs in conjunction with the financial costs a broader picture is seen. This can be fed into the process of deciding whether a response should be implemented for the risk(s) in question.
32
PAS 77:2006
High
Category One
Category Two
Category Three
33
PAS 77:2006
Having assigned the risk category, details of the likelihood, impact and risk category should be added to the risk description in the ITSC plan. At this stage the ITSC plan should contain the risks in category grouping, with Category One risks listed first. To interpret these categories further: a) A Category One risk is one to which the organization should certainly respond; b) A Category Two risk is one to which the organization should consider responding; c) A Category Three risk is one which the organization should consider accepting. No two organizations are the same and thus no firm guidance on interpreting these categories can be given without it being inappropriate to a significant percentage of the audience. Hence, though the guidance above is intentionally vague, it helps to frame the questions the organization should be asking itself at this stage of the process. A.4.3 Develop risk profile For each risk identified as falling into Categories One and Two, a risk profile should be developed, which defines: a) the nature of the risk and the events likely to trigger it; b) the probability of the risk occurring, including details of any circumstances where the likelihood of the risk could change; c) details of the potential impact of the risk on the business, including estimates of the cost to the business of taking no action to prevent or mitigate its impact; d) details of the symptoms likely to be displayed in the
event that the risk occurs and the ways in which these symptoms could be detected; e) an assessment of the likelihood of detecting the risk and measures that could be taken to increase that probability; f) details of existing counter-measures designed to monitor the risk, prevent it from occurring or to mitigate its impact, including estimates of the costs of implementing and maintaining these counter-measures; g) proposals for additional counter-measures, or changes to those in place, to prevent the risk from occurring and to mitigate its impact, including details of the facilities, equipment and personnel required, and estimates of the time, effort and cost required to implement and maintain these new counter-measures; h) estimated savings accruing from implementing the proposed counter-measures in the event that the risk occurs; i) estimated consequential savings likely to accrue from implementing the proposed counter-measures in the event that the risk does not occur. This information provides the basis for a cost-benefit analysis, which should support decision making on how each risk should be addressed by risk monitoring, risk mitigation, risk communication and business continuity planning activities. Details of the risk profile are added to the ITSC plan. A.4.4 Assess probability of detection The probability of symptoms of the risk being detected should be determined according to Table A.4.
34
PAS 77:2006
A.4.5 Response selection A basic model for determining appropriate responses is based upon risk categorization and likelihood of detection. Having categorized the identified risks and having decided whether a response ought to be implemented, the nature of that response should be influenced not only by the potential impact or likelihood of the risk occurring but also the organizations ability to detect that it has occurred. For example, in planning a response to a risk such as the example given for a low probability of detection, the organization might be well advised to consider implementing specialized monitoring processes and/or equipment to make detecting the risk more possible. In the case of the medium probability example, the organization can implement one of a number of common firewall and intrusion detection tools to both identify and prevent such intrusions. A.4.6 Assign responsibility and implement Having determined the appropriate response to the risk, the actions implied should be planned such that resource utilization and cost information is available for cost-benefit analysis. The cost-benefit analysis is an important part of the decision making process for determining which of the potential response actions will be justified and therefore implemented. It is also important information to retain when a decision is taken not to take action in response to a risk, as it demonstrates that a formal and rigorous thought process was followed in arriving at that decision. Details of the decisions taken on the proposed response actions should be added to the ITSC plan and summarized in an action plan and work schedule.
35
PAS 77:2006
36
PAS 77:2006
j) Redundant routing of communications: The ability to communicate in a period of disruption is fundamental to the successful management of an incident. Whilst there may be multiple redundant phone lines into and out of sites, check the telephony provider is not routing all these lines through one common exchange which can be impacted by an incident at that exchange. In addition since email systems can be impacted by an incident, it may be provident to maintain a number of independent email accounts on external Internet Service Providers (ISP) for use in case of emergency. Consideration should be given to providing multiple forms of communication, such as SMS, pagers, external (non-corporate) email systems, pre-agreed brief coded messages (to avoid overloading the networks and to speed communications) and so on. k) Third party connectivity and external links: If the organization depends on the services of a third party provider (for example, in the financial world many companies use third party credit reference agencies), those services should be accessible from the remote site. The contract with the third party should provide a guaranteed level of service in the event of an incident.
37
PAS 77:2006
38
PAS 77:2006
D.2 Active/Contingency
This model introduces a remote or back-up site for recovery only at the time of incident. It is often referred to as a cold back-up site since at the point of incident it usually consists of either an empty computer room, or a computer room populated with inactive computers in an un-initialized state. An alternative to this static computer room is a mobile computer suite provided with generators
39
PAS 77:2006
D.3 Active/Active
At the other end of the spectrum from the Active/Contingency model is the Active/Active model. As this name implies, in normal operation both sites are up and running accepting work at both centres and balancing the load across all computers at both sites. In the event of an incident or system failure at one site then all work is routed to the second site which has been sized to be able to accept the workload increase with little or no reduction in throughput. The advantages and disadvantages associated with this model are listed in Table D.2.
40
PAS 77:2006
D.5 Active/Back-up
In the Active/Back-up model two separate computer suites are maintained, but production only runs at one site, the remote site hosting back-up systems are only enabled when an incident strikes. One way of exploiting the software license issue is to utilize the back-up systems as development, test or training platforms. Many IT companies will reduce the cost of software licences if a system is only used for development work, and will allow production licences to be transferred to the back-up site when an incident strikes, although this can incur additional cost. The advantages and disadvantages associated with this model are listed in Table D.4.
41
PAS 77:2006
42
PAS 77:2006
Figure E.1 Downtime vs. cost $$$$ Continuous Processing Fault Tolerant $$$
Downtime
99.999%
99.99%
99.9%
99.0%
System Availability
43
PAS 77:2006
Availability spans many discreet layers both within and outside the infrastructure with each layer providing additional levels of fault tolerance and/or recovery. The first layer is the physical building and security level that prevents unauthorised access to the proximity of the site. Each layer also includes a subset of systems such as electrical, mechanical, cooling, and security that can be used independently or combined to provide improved levels of availability.
generated by computer and communications equipment, lights, and people. NOTE For the computer and communications equipment it is best to use the peak load figure based on the maximum power requirements of the device. This is normally shown on the devices power requirements label or the system specification document. The size of the data centre, and the positioning of the systems within it, should also be taken into account as both will impact on the cooling requirements. E.2.3 Systems monitoring At the platform level all equipment should be fully monitored (preferably in real-time) to alert of any possible failures. All the large computer and communications vendors provide software utilities to monitor their systems. For ease of administration in a multi-vendor environment it can also be beneficial to look at deploying an enterprise class systems management suite. This will allow you to monitor all vendors systems from a single console. Some vendors also provide a feature called Phone home where a system will send an alert via an email or similar transport mechanism to the vendors support function personnel alerting them of a failing or failed component. The vendor can then dispatch an engineer to resolve the problem. E.2.4 Warranty and support All data centre equipment should have an appropriate level of warranty for maintenance and troubleshooting support. Most system vendors provide a tiered warranty structure ranging from next business day to 24/7x365 with 2 hour fix. The level of warranty purchased for a system is largely dependent on how mission critical the system is to the business.
44
PAS 77:2006
RAID levels that utilise the most disks provide the highest level of redundancy and performance. Each RAID level has its own advantages and disadvantages which are summarized in Table E.1.
1 5
100% disk redundancy Improved read performance over RAID 0 Simple design
Data and parity striped across multiple disks Requires a minimum of three disks
Very good read performance Parity is distributed across all disks Maximum utilization of disk resources
0+1 10
Very good/write performance Same overhead as RAID 1 Can withstand single drive failures across RAID 1 segments
45
PAS 77:2006
E.3.2 Direct Attached Storage (DAS) This is one of the most frequently used storage methods for both internal and external storage. It simply consists of directly attaching the storage device or disks to a computer system using a RAID controller. For internal storage the RAID controller can either be a Peripheral Component Interconnect/SUN Bus (PCI/SBUS) or integrated device. For external storage a number of options are available depending on the intelligence of the storage device. If the external device is just a bunch of disks (JBOD) these will need to be connected to an internal RAID controller. If the external storage device includes disks and a disk controller then an appropriate host bus adapter will need to be installed in the server system. NOTE For instance the external device might be a fibre channel array in which case a fibre channel host bus adapter will be installed in the server. The main disadvantage with DAS is that it creates islands of isolated storage that can only be accessed by the locally attached server. Each pool of storage also needs to be managed separately. Because of its simplicity DAS storage usually has more single points of failure than other storage models. E.3.3 Network Attached Storage (NAS) Most NAS devices use an underlying proprietary Operating System (OS) and file system so that they can be used in a heterogeneous environment. A NAS device will typically present both Network File System (NFS) and Common Internet File System (CIFS) file systems to end users but internally these files systems are usually stored as a separate file system. The majority of NAS devices operate at the file level, although some of the newer models which offer features like mirroring and snapshot technology operate at the block level. NAS devices have become very popular due to the ease with which they can be deployed and centrally managed. Gigabit networking has also helped to expand their usability in the data centre. Most NAS devices also incorporate high availability features at the platform level to protect against disk, power and controller failure. Some of the higher end models support clustering of the NAS device to protect against unit failure. The biggest disadvantage with NAS devices is their potential to saturate the local network during peak usage. This can limit their use for I/O intensive applications. Recent improvements in Local Area Network (LAN) connectivity speeds have helped to offset this limitation. E.3.4 Storage Area Networks (SAN) SAN storage has become increasingly popular over the last few years. Its centralized design helps storage administrators easily deploy and manage storage
resources. The SAN fabric is responsible for carrying data between the host servers and the target storage arrays. The fabric is a dedicated fibre based network designed for high availability through the use of multiple data paths between the hosts and storage array. The SAN array (also known as the storage array) is the subsystem which houses the power supplies, fans, disks, disk controllers and the arrays operating system. SAN arrays are designed for the high availability of mission critical data. All SAN arrays operate at the block level and are independent of the file systems they host. This makes them well suited to heterogeneous environments. In addition to the platform redundancy built into the SAN array there are also a number of other features inherent to SAN storage. These features include Snapshot, Mirroring and Cloning functionality. Snapshot technology allows for point in time copies of data to be created mainly for the purpose of backup. The Cloning feature allows for the creation of point in time copies of data. Unlike Snapshot technology which uses disk pointers to create an instant point in time copy of the data, cloning physically copies every block of data to a new disk. Cloning takes longer than a snapshot but it provides better availability by creating a duplicate of the source disk. Both Snapshot and Cloning store their information in the local array. For increased availability the data can be replicated to a remote array using mirroring technology. There are two main types of mirroring, synchronous and asynchronous. Synchronous mirroring copies each block of data to the remote array and waits for an acknowledgement before writing the block of data to the local array. This ensures that both the remote and local copies of data are always consistent. To implement synchronous replication between two sites will typically require a high speed link such as dark fibre Dense Wavelength Division Multiplexing (DWDM). Synchronous replication is also limited to network path distances of 200 km. To replicate data over extended distances requires asynchronous mirroring. Using this method allows the writes to the local array to continue as normal and then be replicated to the remote array at fixed intervals. The one disadvantage with asynchronous replication is the possibility of losing data that has not been replicated to the remote site should the local site fail.
46
PAS 77:2006
example a large web farm with fifty web servers. In order to balance web requests across all fifty servers each server has load balancing services installed along with a virtual IP address. All fifty servers get the same virtual IP address so each time a web request is received all fifty web servers intercept the request but the load balancing software uses a set of rules to determine which server should process the request. b) Fail-over cluster services can be classified as either shared everything or shared nothing. Like load balancing cluster services they can also be installed as part of the operating system. There are also a large number of third party cluster services applications that integrate with the major OS. Typically cluster services will require a shared storage resource and a network port on each cluster node for sending and receiving heart beat information. 1) The shared everything model allows all nodes in the cluster to have shared access to the cluster resources. In order to achieve this, the cluster needs to use a distributed lock manager to control node access. Some users have questioned the scalability of the shared everything model because of the overhead and complexity of managing the resource locking. 2) The shared nothing model also uses shared storage but only one node in the cluster has access to a resource at any given point in time. The shared nothing model is often referred to as Active/Passive or Active/Active. An Active/Passive cluster is when one node in the cluster manages all the resources and the second node acts as a fail-over node that takes ownership of the active resources in the event of a failure. The Active/Active model is when both nodes in the cluster are actively hosting independent resources. In the event of a failure the surviving node takes ownership of all resources. Both models of fail-over clustering are widely used for applications such as web, file and print, messaging and database servers. Some of the higher end clustering services software packages also include data replication features. These features can be used to build what are known as stretch clusters. A stretch cluster basically allows you to increase the distance between the cluster nodes. NOTE An example of a stretch cluster is where one node might be located in London whilst the second node is located in Manchester. The replication engine ensures that the data copies at both locations are consistent. In the event of a failure the replication engine simply brings the secondary copy of data online so the surviving node can take ownership of all the cluster resources. This process can take a couple of minutes but importantly there is no outage at the application level and the fail-over is transparent to the end users.
47
PAS 77:2006
48
PAS 77:2006
at multiple times during the day. If keeping back-ups on site, they should be stored in a controlled environment or a fireproof safe ideally at some distance from the original source. It is a common mistake to make critical back-ups but leave them sitting in the office reception or on the loading dock for hours waiting for an offsite courier to take them to a secure store. If the back-up copies are held at a back-up site it can be beneficial to load up files from tape to disk. This verifies the back-up (the tapes can be read) and can also reduce the recovery time. A more sophisticated variation on the standard tape back-up mechanism is to perform remote back-ups. Usually only seen on high-end systems with high speed fibre links, the data is backed up to tape at the remote back-up site providing both a back-up and secure offsite back-up at the same time.
be in step with other applications being recovered by the organization. Thus if data needs to be synchronized between applications, additional recovery steps may be required to resynchronize the applications.
49
PAS 77:2006
load on the host system. Handling replication at the control unit level also allows the control unit to create multiple copies or snapshots of the volumes. These snapshots can be used to drive separate applications such as overnight batch processing and their careful use can vastly improve overall system throughput. However it should be noted that depending on how it is used, storage array based replication can introduce I/O latency and potentially delay the I/O completion.
50
PAS 77:2006
Bibliography
Standards publications
PAS 56: 2003, Guide to Business Continuity Management BS ISO/IEC 20000, Information Technology Service Management ISO/IEC 17799:2005 Code of Practice for Information Security Management ISO Guide 73:2002, Risk management Vocabulary Guidelines for use in standards
Other publications
[1] IT Infrastructure Library (ITIL). Office of Government and Commerce: The Stationery Office. [2] PRINCE2 Maturity Model (P2MM). Office of Government and Commerce (OGC). [3] Project Management Body of Knowledge. Project Management Institue (PMI).
Further Reading
TR 19:2005, Technical Reference for Business Continuity Management (Bt GM). Spring Singapore. Emergency Preparedness: Guidance on Part 1 of the Civil Contingencies Act 2004, its associated Regulations and non-statutory arrangements. Home Office: The Stationery Office. Generally Accepted Practices for Business Continuity Practitioners. Disaster Recovery Journal and DRI International, 2005. Business Continuity. CBI with Computacenter, 2002. A Risk Management Standard. The Institute of Risk Management, The Association of Insurance and Risk Managers and The National Forum for Risk Management in the Public Sector, 2002. Microsoft Operations Framework, a pocket guide, Van Haren Publishing, ISBN 9077212108. Management of Risk: Guidance for Practitioners. Office of Government and Commerce: The Stationery Office. A Guide to Business Continuity Planning by James C. Barnes, ISBN 0-471-53015-8.
51
PAS 77:2006
Information on standards
BSI provides a wide range of information on national, European and international standards through its Library and its Technical Help to Exporters Service. Various BSI electronic information services are also available which give details on all its products and services.
Copyright
Copyright subsists in all BSI publications. BSI also holds the copyright, in the UK, of the publications of the international standardization bodies. Except as permitted under the Copyright, Designs and Patents Act 1988 no extract may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, photocopying, recording or otherwise without prior written permission from BSI. This does not preclude the free use, in the course of implementing the standard, of necessary details such as symbols, and size, type or grade designations. If these details are to be used for any other purpose than implementation then the prior written permission of BSI must be obtained. Details and advice can be obtained from the Copyright & Licensing Manager. Tel: +44 (0) 20 8996 7070 Fax: +44 (0) 20 8996 7553 Email: copyright@bsi-global.com BSI, 389 Chiswick High Road London W4 4AL.
Buying standards
Orders for all BSI, international and foreign standards publications should be addressed to Customer Services. Tel: +44 (0)20 8996 9001 Fax: +44 (0)20 8996 7001 Email: orders@bsi-global.com Standards are also available from the BSI website at http://www.bsi-global.com In response to orders for international standards, it is BSI policy to supply the BSI implementation of those that have been published as British Standards, unless otherwise requested.
52
British Standards Institution 389 Chiswick High Road London W4 4AL United Kingdom http://www.bsi-global.com ISBN 0 580 49047 5