C 1 Basic High Availability Concepts: Hapter
C 1 Basic High Availability Concepts: Hapter
C 1 Basic High Availability Concepts: Hapter
Available
The term available describes a system that provides a specific level of service as needed. This idea of availability is part of
everyday thinking. In computing, availability is generally understood as the period of time when services are available (for
instance, 16 hours a day, six days a week) or as the time required
for the system to respond to users (for example, under one-second
response time). Any loss of service, whether planned or
unplanned, is known as an outage. Downtime is the duration of
an outage measured in units of time (e.g., minutes or hours).
Highly Available
Service Levels
The service level of a system is the degree of service the
system will provide to its users. Often, the service level is spelled
out in a document known as the service level agreement (SLA).
The service levels your business requires determine the kinds of
applications you develop, and HA systems provide the hardware
and software framework in which these applications can work
effectively to provide the needed level of service. High availability implies a service level in which both planned and unplanned
computer outages do not exceed a small stated value.
Continuous Availability
Continuous availability means non-stop service, that is,
there are no planned or unplanned outages at all. This is a much
more ambitious goal than HA, because there can be no lapse in
service. In effect, continuous availability is an ideal state rather
than a characteristic of any real-world system.
This term is sometimes used to indicate a very high level of
availability in which only a very small known quantity of downtime is acceptable. Note that HA does not imply continuous availability.
Fault Tolerance
Fault tolerance is not a degree of availability so much as a
method for achieving very high levels of availability. A fault-tolerant system is characterized by redundancy in most hardware
components, including CPU, memory, I/O subsystems, and other
elements. A fault-tolerant system is one that has the ability to
continue service in spite of a hardware or software failure. However, even fault-tolerant systems are subject to outages from
human error. Note that HA does not imply fault tolerance.
Disaster Tolerance
Disaster tolerance is the ability of a computer installation to
withstand multiple outages, or the outage of all the systems at a
single site. For HP server installations, disaster tolerance is
achieved by locating systems on multiple sites and providing
architected solutions that allow one site to take over in the event
5nines:5minutes
In 1998, HP management committed to a new vision for HA
in open systems: 99.999% availability, with no more than five
minutes of downtime per year. This ambitious goal has driven the
development of many specialized hardware and software facilities
by a number of vendors working in partnership. As of the year
2001, HPs own contributions include new generations of faultresilient HP 9000 systems, improvements in the HP-UX operating system, new software solutions, and extensive monitoring
tools that make it possible to measure downtime with a high
degree of precision. Many of these improvements have been
added back into the standard HP hardware and software products
in a kind of trickle-down of technological improvement.
Not all users need every type of device or tool used to provide availability levels as high as 99.999%. Not all users wish to
pay the price that such tools command in the marketplace. But
everyone benefits from the effort to meet the goal of a very high
degree of availability as the technology advances. Consider the
analogy of race car engines: Even though you dont expect to see
a race car engine in a family sedan, the technology used in building and improving the race car engine eventually ends up improving the sedan anyway.
E-vailable Computing
The phenomenal expansion of Internet business activity has
created the need to define yet another type of availability: e-vailability, the availability of a server to support fast access to a Web
site. It is well known that at periods of peak demand, Web sites
suffer performance degradation to the point that users cancel an
attempted transaction in frustration at waiting too long or at being
refused access temporarily. E-vailability is a combination of the
traditional kinds of availability described earlier in this chapter
and sufficient server performance and capacity to meet peak
demands. Figure 1.3 shows the relationship of availability, performance, and capacity to achieve high levels of e-vailability.
By managing the components of e-vailability, you can allocate different levels of availability to different users depending on
their standing as customers. For example, the premier customer
class might be given the quickest access to a Web site, say under
one second, whereas ordinary customers might get access in one
to five seconds, and non-customers (simple Internet cruisers)
might obtain access in five to ten seconds. Thus within the
framework of e-vailability, HA can be a commodity that customers pay for, or it can be a reward for loyalty or high spending levels.
Availability
Capacity
Performance
Since any component can fail, the challenge is to design systems in which problems can be predicted and isolated before a
failure occurs and in which failures are quickly detected and corrected when they happen.
Choosing a Solution
Your exact requirement for availability determines the kind
of solution you need. For example, if the loss of a system for a
few hours of planned downtime is acceptable to you, then you
may not need to purchase storage products with hot pluggable
disks. On the other hand, if you cannot afford a planned period of
maintenance during which a disk replacement can be performed
on a mirrored disk system, then you may wish to consider an HA
disk array that supports hot plugging or hot swapping of components. (Descriptions of these HA products appear in later sections.)
Keep in mind that some HA solutions are becoming more
affordable. The trickle-down effect has resulted in improvements
in HPs Intel-based NetServer systems, and there has been considerable growth in the number of clustering solutions available
for PCs that use the Windows and Linux operating systems.
10
11
need for HA may only be for that portion of the day when a particular stock market is active; at other times, systems may be
safely brought down.
12
13
clear definitions of what the numbers mean and then use them
consistently. Remember that availability is not a measurable
attribute of a system like CPU clock speed. Availability can only
be measured historically, based on the behavior of the actual system. Moreover, in measuring availability, it is important to ask not
simply, Is the application available? but, Is the entire system
providing service at the proper level?
Availability is related to reliability, but they are not the same
thing. Availability is the percentage of total system time the computer system is accessible for normal usage. Reliability is the
amount of time before a system is expected to fail. Availability
includes reliability.
Calculating Availability
The formula in Figure 1.4 defines availability as the portion
of time that a unit can be used. Elapsed time is continuous time
(operating time + downtime).
14
Minimum
Expected
Hours of
Uptime
Maximum
Allowable
Hours of
Downtime
Remaining
Hours
99%
8672
88
99.5%
8716
44
99.95%
8755
100%
8760
15
The table shows that there is no remaining time on the system at all. All the available time in the year (8760 hours) is
accounted for. This means that all maintenance must be carried
out either when the system is up or during the allowable downtime hours. In addition, the higher the percentage of availability,
the less time allowed for failure.
Table 1.2 shows a 12x5x52 system, which is expected to be
up for 12 hours a day, 5 days a week, 52 weeks a year. In such an
example, the normal operating window might be between 8 A. M.
and 8 P. M., Monday through Friday.
Table 1.2 Uptime and Downtime for a 12x5x52 System
Availability
Minimum
Expected
Hours of
Uptime
Maximum
Allowable
Hours of
Downtime
During Normal Operating Window
Remaining
Hours
99%
3088
32
5642
99.5%
3104
16
5642
99.95%
3118
5642
100%
3118
5642
16
This table shows that for the 12x5x52 system, there are
5642 hours of remaining time, which can be used for planned
maintenance operations that require the system to be down. Even
in these environments, unplanned downtime must be carefully
managed.
17
18
Duration of Outages
An important aspect of an outage is its duration. Depending
on the application, the duration of an outage may be significant or
insignificant. A 10-second outage might not be critical, but two
19
hours could be fatal in some applications; other applications cannot even tolerate a 10-second outage. Thus, your characterization
of availability must encompass the acceptable duration of outages. As an example, many HP customers desire 99.95% availability on a 24x7 basis, which allows 5 hours of downtime per
year. But they still need to determine what is an acceptable duration for a single outage. Within this framework, many customers
state that they can tolerate single unplanned outages with a duration of no more than 15 minutes, and they can tolerate a maximum of 20 such outages per year. Other customers frequently
wish to schedule planned downtime on a weekly, monthly, or
quarterly basis. Note that allowing for planned downtime at a
given level of availability reduces the number or duration of
unplanned outages that are possible for the system.
20
Failure!
T0
T1
T1a T2
T3
Clients Connected
T4
Clients Connected
Unplanned Downtime
Disk
Crash
Detect
Decide Halt
Replace
Problem on Fix System Disk
21
Figure 1.7 shows the same crash when the system uses an
HA feature known as disk mirroring, which prevents the loss of
service.
Failure!
T1
T0
Clients Connected
Disk
Crash
T1a T2
Clients Connected
T3
Clients Connected
Planned
Downtime
Detect
Problem
Replace
Disk
T4
Rebuild
Data
22
Failure!
Spare Disk
T1a T2
T1
T0
Clients Connected
Disk
Crash
Clients Connected
T3
Clients Connected
Detect
Problem
Replace
Disk
T4
Clients Connected
Rebuild
Data
23
ging a new disk mechanism while the system is running. After the
replacement disk is inserted, the array returns to the state it was in
before the crash.
Periodic backups
Software upgrades
Hardware expansions or repairs
Changes in system configuration
Data changes
24
Hardware failure
File System Full error
Kernel In-Memory Table Full error
Disk full
Power spike
Power failure
LAN infrastructure problem
Software defect
Application failure
Firmware defect
Natural disaster (fire, flood, etc.)
Operator or administrator error
25
Hardware
20%
40%
40%
IT Processes
Source: GartnerGroup
December 1998
Operator Errors
26
27
28
within your organization. The SLA can state the normal periods
of operation for the system, list any planned downtime, and state
specific performance requirements.
Examples of items that appear in SLAs include:
System will be 99.5% available on a 24x5x52 basis.
Response time will be 1-2 seconds for Internet-connected
clients except during incremental backups.
Full backups will take place once each weekend as
planned maintenance requiring 90 minutes.
Incremental on-line backups will be taken once a day during the work week with an increase in response time from
2 to 3 seconds for no more than 30 minutes during the incremental backup.
Recovery time following a failure will be no more than 15
minutes.
The SLA is a kind of contract between the information technology group and the user community. Having an explicit goal
makes it easier to see what kind of hardware or software support
is needed to provide satisfactory service. It also makes it possible
to identify the cost/benefit tradeoff in the purchase of specialized
HA solutions.
Note that the system architecture will be different for a large
back-end database server than for a system of replicated Internet
servers that handle access to a Web site. In the latter case, the use
of multiple alternate server systems with sufficient capacity can
provide the necessary degree of e-vailability.
29
Routine backups
Routine maintenance tasks
Software upgrades
Recoveries following failures
30
31
32
33
34
Software Quality
Software quality is another critical factor in the overall
scope of HA and must be considered when planning a highly
available processing environment. The presence of a software
defect can be every bit as costly as a failed hardware component.
Thus, the operating system, middleware modules, and all application programs must be subjected to a rigorous testing methodology.
Intelligent Diagnostics
Sophisticated on-line diagnostics should be used to monitor
the operating characteristics of important components such as the
disks, controllers, and memory, and to detect when a component
is developing problems. The diagnostic can then proactively
notify the operator or the component vendor so that corrective
maintenance can be scheduled long before there is a risk of an
35
36
37
Eliminating human interaction allows you to create deterministic responses to error conditions: the same error condition
always results in the same system response. The use of networked
monitoring tools also lets you automate responses to errors.
Any proposed HA design should be thoroughly tested
before being placed in production.
NOTE: If you are really concerned about HA, there is no room for
compromise. The upper-most goal must be meeting the HA requirements, and other considerations, such as cost and complexity, take
second place. It is important to understand these tradeoffs.
Summary
A highly available system must be designed carefully on
paper. It is important to do the following in the order specified:
1. Define a goal for availability, including a detailed listing
of your service level objectives for each application or
service.
2. Identify the maximum duration of an acceptable outage.
3. Measure the availability of the current system, if one is
in use. This includes understanding current statistics on
availability, including planned and unplanned downtime. Be sure to use measurements consistently, and
make sure everyone understands what the measure-
38
Summary
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
39