Availability Benefits of Linux On Z System
Availability Benefits of Linux On Z System
Availability Benefits of Linux On Z System
David Raften
Raften@us.ibm.com
6 Summary ....................................................................................................................................... 18
Although the Linux operating system and applications do not care what the underlying hardware is, it
does matter from a cost and availability perspective. In the past, many data centers chose to run a
single application at a time on a server, often getting 10% utilization out of the server. Today with
virtualization such as VMware, the utilization is higher, but not much. When one considers the amount
of servers configured for:
• Development
• Quality Assurance / Test
• Production
• Backup for production at the primary site
• Then double it for disaster recovery
The utilization is often no more than 35%. Although some users get the primary production server at a
higher utilization, the average across the data center is low. This is even more dramatic when one
considers the configuration for a single application is then duplicated for each of the hundreds or
thousands of applications being run. Each of the tens of thousands of servers incurs expenses of:
• System Management. Also a large part of the data center budget, how do you maintain all the
software to keep it current? How do you maintain the hardware?
• Facilities. Each one uses electricity, floor space, and cooling chiller systems. The availability of
electricity has driven many companies to spend millions of dollars to create new data centers
away from cities where power is more readily available.
• Hardware. If you buy a product, expecting to use all of it, are you happy if you can only use a
third? Why is this acceptable for servers?
z System can improve the average utilization of all the servers to near 100%, one would need
significantly less servers to run the workload. The savings can be even greater by then using faster
servers. Many sites have seen a 20:1 ratio in the number of cores after moving applications to Linux on
z with up to 40% reduction in total data center expenses.
• Missed business opportunity. Look at the number of transactions not run during the period of
the outage and the average revenue generated by each transaction. This needs to be modified
by the transactions that can be deferred until when the system comes up again.
• Loss of productivity. What is the hourly cost of all the employees affected that can no longer do
their job? What is the hourly cost of the data center?
• Loss of brand image and customers. If the system is often unavailable or even just getting bad
performance, how many customers will permanently move to your competition?
• Other factors. This includes financial penalties, overtime payments, and wasted goods.
The cost per minute of an outage increases with the duration of the outage. The effects of the impact to
customer service are subjective and depends on how frequently outages occur and for how long. The
more customers are affected by the outage, the more chance there is of them taking their business
elsewhere.
Different hardware platforms have different availability characteristics. This affects the bottom line on
the Total Cost of Ownership for the application solution.
As soon as the first Linux transaction runs on z Systems, it gets all the availability protection that the z
System is known for without any application change.
IBM has a requirement: each System z server needs to be better than its predecessor. For this to take
place, for each of the major subsystems the z Systems addresses the different levels of availability
concerns. Some major functions include the following:
o Capacity backup systems - systems used for emergency backup; keep them "running"
but reduce energy consumption. Systems can quickly be brought back to full
performance.
A more detailed, although not inclusive, list of availability features of the z Systems can be found in
Appendix A – Selected z System availability features.
The z System hardware provides other features to improve availability for proactive error detection and
notification. These include:
• Call Home
• IBM zAware
The server also looks for degraded conditions. Although the server is still operating, some hardware is
not working.
• Loss of channels due to CPC hardware failure
• Loss of memory
• The drawer is no longer functioning
• Capacity BackUp (CBU) resources have expired
• Processor cycle time reduced due to temperature problem
• CPC was IMLed during cycle time reduction.
• Repeated intermittent problems
When it detects a problem, the call home service automatically gathers the basic information to resolve
the problem and sends an email with log files or other diagnostics for the failure condition. IBM Support
processes the email information, opens a Problem Management Record (PMR) and assigns it to a
The ability to diagnose and report on troublesome, but still working components has at times allowed
IBM customer engineers (CEs) to come with a replacement and dynamically change the part before any
failure has occurred.
IBM z/VM is the premier mainframe virtualization platform, supporting thousands of virtual servers in a
single footprint, more than any other platform. The z/VM hypervisor is designed and developed in
conjunction with the z System hardware. As such, it can exploit new hardware functions for
performance, security, and availability and pass these benefits on to its guests such as Linux, as well as
z/VSE, z/OS, and z/TPF.
There are many examples of z/VM exploiting hardware functions. Some examples include High
Performance FICON (zHPF) for more I/O throughput, HyperPAV for less I/O contention, Simultaneous
Multi-Threading (SMT) for performance, hardware based cryptographic acceleration, zEDC Express for
high-performance, low-latency hardware data compression while reducing disk space and improving
channel and networking bandwidth, and of course, all of the availability features described in the
previous chapter.
As well as the performance and security capabilities, from an availability perspective the power of z/VM
is its ability to efficiently virtualize hardware components as well as its implementation of Live Guest
Relocation.
The ability to virtualize hardware components adds another layer of availability. Efficient virtualization
of processor, memory, communications, I/O, and networking resources help reduce the need to
duplicate and manage hardware, programming and data resources. z/VM can significantly over-commit
these real resources and allow users to create a set of virtual machines with assets that exceed the
amount of real hardware available. This reduces hardware requirements and simplifies system
management. Because resources are virtualized, a z/VM guest sees only what z/VM presents to it. If
there is a problem with a hardware component, z/VM can seamlessly switch to use the redundant
component and hide this from the guest.
One example is the ability to balance the workload across multiple cryptographic devices, and should
one device fail or be brought offline, z/VM can transparently shift Linux systems using that device to an
alternate cryptographic device without user intervention.
Another example of this is Multi-VSwitch Link Aggregation support. It allows a port group of OSA-
Express network adapter features to span multiple virtual switches within a single z/VM system or
between multiple z/VM systems. A single VSwitch can provide a link aggregation group across multiple
network adapters, and make that highly-available connection available to guests transparently. Sharing
a Link Aggregation Port Group with multiple virtual switches increases optimization and utilization of the
OSA-Express adapters when handling larger traffic loads and enables sharing the network traffic among
multiple adapters while still presenting only a single network interface to the guest.
HiperSockets can be used for communication between Linux, as well as z/OS, z/VM, and z/VSE instances
on the same server. It provides an internal virtual IP network using memory to memory communication.
z/VM exploits the hardware functions through direct execution of the machine instructions. Since it
knows what hardware it is running on there is no need for an additional layer to trap hardware-directed
instructions such as for disk or network access, and then emulate them. This allows z/VM and its guests
to run significantly faster than other solutions that need to interrupt execution, emulate instructions,
and then follow pointers to the emulated code. The system runs at native speed.
The virtualization capabilities in a single System z footprint can help to support thousands of virtual
Linux servers. Since a single IBM System z server doesn’t require external networking to communicate
between the virtual Linux servers, all of the Linux servers are in a single box, communicating via very fast
internal I/O connections.
The ability of z/VM to provide simple virtualization at high performance helps provide availability as
seen by the end user.
The most prevalent outage type in a z Systems environment is for software or hardware maintenance or
upgrades. The IBM z/VM Single System Image Feature provides live guest relocation, a process where a
running virtual machine can be relocated from one z/VM member system of a cluster to another. Virtual
servers can be moved to another LPAR on the same or a different z Systems without disruption to the
business. Relocating virtual servers can be useful for load balancing and for moving workload off of a
physical server or member system that requires maintenance. After maintenance is applied to a
member, guests can be relocated back to that member, thereby allowing z/VM maintenance while
keeping the Linux on System z virtual servers available.
Checks are in place before a Linux guest is relocated to help avoid application disruption. Some checks
include:
• It has enough resources available on the target system, such as memory, CPU, and so on.
VMware supports Live Guest Relocation with the Vmotion technology, but it has a different design
point. It was designed not to provide for planned outages, but rather to try to help avoid unplanned
Another differentiation is the flexibility of where a guest can be relocated. X86 servers often do not
support full backward compatibility. In this environment one must plan for the target for each guest
and upgrade the servers as a group or else have the administrator fence off specific instruction sets if a
guest is moved to an older server model. By design, z System supports full backward compatibility.
Applications written in 1965 can still run on today’s servers. While some hardware features such as
hardware encryption may not be consistent across the servers, the Linux guests will still run
uninterrupted.
Massive virtualization massively reduces the amount of hardware in the infrastructure. There are that
times many less servers, and since Linux to Linux communication can take place by using virtual links, an
exponential times less cables and ports. From an availability point of view, the end to end availability as
measured by the user requires all of the components to be available. For example, if there are four
components that are touched such as servers, routers, ports, etc., and each at 99% available, then the
net effective availability is .99 x .99 x .99 x .99 = .9606, or about only 96% available.
System management is that much easier. Cloning servers is much easier then installing servers.
Provisioning new servers can be done in just a couple of minutes as compared to days for real servers.
All this affects availability.
There is also less hardware that can fail. The distributed model for providing High Availability is to
deploy redundant physical servers. Often, this means more than just two, but rather several physical
servers clustered together so that if any one of them fails there will be enough spare capacity spread
around the surviving servers in the cluster to absorb the failed guy’s work. But, something often not
considered is that as the number of physical servers increases, so do the number of potential points of
failure – you have eliminated single points of failure, but by increasing physical components you have
increased the odds that something will fail. By contrast, you can put z/VM LPARs on the same server and
eliminate all single points of failure with only two z/VM instances – except for the z System server itself
which, as explained above, is highly available. Furthermore, since we can share CPU capacity between
those two LPARs, if one entire z/VM should fail the surviving z/VM will instantly and transparently
inherit the failed z/VM’s CPU capacity (although not its memory). It is like squeezing a balloon – one
side gets smaller and the other gets bigger.
From a disaster recovery point of view, recovery planning and actions are that much easier. There are
less servers, hypervisors, and multi-vendor provisioning tools to worry about. Recovery planning and
activity is that much easier. In the event of a total site failure, bringing production images and
workloads up at the recovery site can now consistently be done within a single shift. Best of all, if a
As an additional benefit, having less servers greatly reduces software fees, system management
expenses, total hardware expenses, energy usage, and floor space.
The Linux Health Checker tool can identify potential problems before they impact your system’s
availability or cause outages. It collects and compares the active Linux settings and system status for a
system with the values provided by health-check authors or defined by the user. It produces output in
the form of detailed messages, which provide information about potential problems and the suggested
actions to take.
Although the Linux Health Checker will run on any Linux platform which meets the software
requirements, currently available health check plug-ins focus on Linux on z Systems. Examples of health
checks include:
• Configuration errors
• Deviations from best-practice setups
• Hardware running in degraded mode
• Unused accelerator hardware
• Single point-of-failures
5 DISASTER RECOVERY
When most people think of when they would need to implement a disaster recovery plan is in event of
major “front page” events such as major natural or man-made disasters such as flooding, earthquakes,
or plane crashes. But in reality, it is much more likely to require sustain a site failure or temporary
outage due to other, smaller, factors. Real examples have included air conditioner failure, train
derailment, a snake shorting out the power supply, a coffee machine leaking water, or smoke from a
nearby restaurant. Often the management decision is to not declare a disaster because it would take
too long to restore service at the recovery site, there will be data loss, and there is no easy plan to bring
service back to the primary site. This is usually not due to the issues with the z System servers, but
rather with the distributed environment. The decision is to “gut it out,” and wait until service can be
restored. While this is happening, money is being lost for the company.
There are two common options on how the recovery site is managed: It can be “in-house,” owned by
the company, or it can be managed by a business resiliency service provider. These have two very
different implications for recovering the x86 servers with much less differences for z System servers.
One difference between z System and distributed environments is the variability. There are a lot of
different distributed operating systems such as Windows, Unix (AIX, Sun, HP-UX, … ), or Linux (RHEL,
SUSE, … ). Each of these come with different release and version levels. On top of that are the different
hypervisors such as VMware, KVM, or HyperV. As a result, many of these have operating systems have
dependencies tied to a specific hardware abstraction layer which is tied to physical or virtual systems. It
is impossible for a recovery service provider to duplicate the exact same hardware configuration for all
its customers, so in a disaster recovery situation, especially if there is a regional event, the hardware
configuration at the recovery site will be different from what is being run in the production site. In fact,
How do you recovery on dissimilar hardware? With a finite amount of assets at the recovery site, a
recovery service provider cannot mirror the specific hardware configuration for every client, including
the server type, storage type, firewalls, load balancers, routers, gateways, etc. Kernel drivers may be
tied to specific hardware, so before restored systems can be started, it may be required to first modify
or update the operating system level and device drivers to match the target destination recovery
hardware. Since the service providers may guarantees an “equal or greater” hardware platform,
nothing is known ahead of time what will be used. If there are issues, then multiple skill sets are needed
to do problem determination. Consequently, you may run into performance issues once the recovered
systems come up due to applications being tied to specific hardware devices or some systems may just
not be recover on the new hardware. This issue can be eliminated by running Linux on z. Even though
there can be different levels of z System hardware (z196, zEC12, z13... ) at different driver levels, and all
of the z System hardware is backwards compatible. In addition, the extreme virtualization provided by
z/VM reduces the amount of hardware variability.
Unlike disk is also an issue for the same reasons as unlike servers. Although SCSI attached disk is
supported with Linux on z with support for FB format data, many chose to place the Linux data on ECKD
formatted disk for advantages of system management, reliability, and less CPU consumption. This disk is
storage agnostic. ECKD disk on any storage vendor all appear as the same generic (3390) disk due the
standardized interface. This plus the fact that there is no internal disk on z System reduces complexities
managing different disk devices and driver levels.
Distributed systems need to restore the production images prior to restoring databases. These
production images often have many different drive volumes (C-drive, D-drive, etc.) sometimes with a
dozen or more drives for each system. This can easily amount to hundreds of drives that need to be
restored using tools such as the Tivoli Storage Manager, Symantec NetBackUp servers, Fiber Channel
libraries, etc. If restoring from tape, one quickly runs into a tape drive bottleneck. If restoring from LAN,
then the network becomes a bottleneck. This process can typically take 6 or more hours to just bring up
the backup restore servers before even database restores can be started. This process is not needed
with System z. Just connect to the RESLIB volume containing the z/VM libraries, IPL the LPARs, and you
have immediate access to the applications and data.
Database restoration on distributed systems can also be an issue. If using tape, the manner in which
tapes store data, and the data volume and file data size are a factor in restores. If it takes the mounting
of multiple tapes to access a single server’s data to restore, then other systems are waiting for access to
the tape drives. If the data is restored via the network, there needs to be enough network bandwidth on
Backup/Restore server network adapters and LAN to restore the data. Crossing low bandwidth hops
could cause a restore bottleneck. z Systems can run 50 - 100 restore jobs at a time if restoring from a
fiber channel library media. Restores occur via SAN FICON environment and are NOT LAN based. In
addition, there are 8x8 configurable FICON paths to reach 64GB bandwidth per subsystem.
Production sites have hundreds or more servers. But for testing purposes the disaster recovery provider
may not have the same number of servers available. For example, one could have 250 servers in
Finally, z System supports extensive end-end automation such as with GDPS (see section “GDPS Virtual
Appliance”). This not only speeds up processes but more importantly removes people as a “Single Point
of Failure” during recovery. It is designed to automate all actions needed to restart production
workload in under 1 hour. This works with not only z/OS, but also Linux on z images.
Due to the time needed to fully restore the distributed environment, bring up the applications, and
resolve any data consistency issues between tape restores and disk remote copy restores, many sites
have gotten to the point of just bringing up the distributed server environments and data after three
days, then declared, “Success!” without actually running the applications or resolving the consistency
issues. This leaves a big hole in the D/R testing with the possibility of unknown problems coming up
should a real situation happen. z Systems are often fully restored and tested within a single shift.
Due to the issues described above, many of the larger corporations have chosen to invest in the
hardware and facility expenses for a dedicated in-house recovery site. This resolves many of the issues
described above, but at the expense of costs to keep another copy of the physical hardware such as
servers, disk, routers, etc. at the remote site, the floor space and energy usage of all the equipment, and
the system management costs of making sure the D/R site stays at a mirror image of the production site.
Consequently, many clients find themselves slowly moving production into their recovery environments
over time to justify costs. Eventually the fine line between production and recovery environments will
become blurred.
Similarly, an additional complication is that sites often want to run Development and Test workloads on
the recovery servers. In the wake of a disaster the testing infrastructure disappears just when it is
needed the most since it becomes preempted for Production. One needs to plan for where this work will
now be run.
The most significant issue that is not resolved by in-house disaster recovery is the complexity of
recovery. The more heterogeneous servers there are, the more one needs to constantly fine-tune and
practice the D/R plan. Some considerations include:
• Documentation – Is the plan is well documented, detailed, easy to follow, consolidated, and
current. In a real disaster, experienced staff may not be available.
• Complexity of applications – Which applications are the critical ones that need to be restarted
first? What about their dependencies? Is the e-mail system more important than customer
facing applications so you can communicate problems that may come up during recovery?
• Plethora of server types and levels – The more components, the more people on site are
required to manage the recovery, and the more things can go wrong. What about compatibility
of the different hardware and software levels. Does the configuration at the DR site reflect the
same configuration at the Production sites?
• How many backup tools are used – For each tool are multiple people trained to use them?
• Is there a plan to get back to the original environment – This is something often not taken into
consideration. How do you resynchronize the data back on the original site?
Despite having an in-house recovery site, due to the complexity of trying to manage and control the
recovery for hundreds or thousands of distributed servers, meeting the recovery time objective (RTO) is
at times not obtainable.
Once the databases are restored, they may not be usable. Different applications have different
Recovery Time Objectives (RTO), or how long can the business accept the application being unavailable,
and Recovery Point Objectives (RPO), how much data can the business afford to lose. The least
expensive option is to make a copy of the database every 24 hours and send the tapes off site. This
supports an RPO of 24 hours and RTO of typically three days. At the other end of the spectrum is disk to
disk remote copy, which support an RPO of 0 (no data loss), with an RTO of 2 hours or less. A list of
disaster recovery options can be found at:
http://en.wikipedia.org/wiki/Seven_tiers_of_disaster_recovery . As one moves up the tiers the cost
increases. With that in mind, many use different D/R options depending upon the application.
Many recommend a mixed-tier approach with the D/R solution being used dependent on the recovery
time objectives (RTO) and recovery point objections (RPO) for the applications being run. Some servers
will then recover on a pre-staged dedicated assets and others on hot-site syndicated hardware made
available within 24 hours of an event. In this case, systems and data may be recovered by order of
priority with a staggered RTO. Critical database application data, usually on z system, can be made
available within 4 hours or less, and applications / web services / data on distributed systems can be
made available in 24 hours or greater. This causes complications. It is often the case that applications
share common files. Not only that, but often “Tier 1” applications rely on data generated by “Tier 3”
applications. How is data consistency maintained when some data is 30 seconds old, and others are 24
hours old? Are the applications run and the data corruption accepted? How is the corruption resolved?
Can the required nightly batch jobs be run? Even if all the data is replicated by disk remote copy, due to
the different disk vendors being used, there is still the issue of a common consistency group between
the vendors.
The IBM San Volume Controller (SVC) can resolve the issue of providing a single consistency group
between disk vendors by using the same Metro Mirror or Global Mirror session for all the disk being
virtualized under it.
There are several tools that can be used to help manage and monitor the remote copy environment.
This includes IBM Spectrum Control, Virtual Storage Center (VSC), Tivoli Productivity Center for
Replication (TPC-R) and GDPS. Note that the GDPS Control LPAR requires z/OS with ECKD disk.
In a real site disaster, there is not have the luxury of several weeks’ notice to update D/R plans and get
the key personnel to the remote site ahead of time. In fact, the key personnel may be unavailable, may
not physically be able to get to the remote site or connect to it through the network, may have other
priorities such as the physical safety of family members or the home, or may not survive the event.
GDPS is an integrated end-to-end automated disaster recovery solution designed to remove people as a
Single Point of Failure.
There are several different flavors of GDPS, depending on the type of replication being used. It includes
GDPS/PPRC HyperSwap Manager and GDPS/PPRC to manage and automate synchronous Metro Mirror
replication, GDPS/XRC and GDPS/Global Mirror for asynchronous replication, and GDPS/Active-Active
based on long distance software based replication.
GDPS/PPRC enables HyperSwap, the ability to dynamically switch to secondary disk without requiring
applications to be quiesced. Swapping in under seven seconds user impact time for 10,000 device pairs,
this provides near-continuous data availability for planned actions and unplanned events. It provides
disk remote copy management, and data consistency for remote disk up to 200 km away with qualified
DWDMs. GDPS/PPRC is designed to fully automate the recovery at the remote site. This includes disk
reconfiguration, managing servers, Sysplex resources, CBU, activation profiles, etc. GDPS/PPRC can be
used with any disk vendor that supports the Metro Mirror protocol. GDPS automation includes:
All this is done to support a Recovery Time Objective (RTO) less than an hour with a Recovery Point
Objective (RPO) of zero.
GDPS/PPRC is application and data independent. It can be used to provide a consistent recovery for z/OS
as well as non-z/OS data. This is especially important for when a multi-tier application has dependencies
GDPS Virtual Appliance is based on GDPS/PPRC, designed for sites who do not have z/OS skills to
manage the GDPS control system (“K-Sys”). The GDPS Virtual Appliance delivers the GDPS/PPRC
capabilities through a self-contained GDPS controlling system that is delivered as an appliance. A
graphical user interface is provided for monitoring of the environment and performing various actions
including maintaining the GDPS control system, making z/OS invisible to the system programmers. This
provides IBM z Systems customers who run z/VM and their associated guests such as Linux on z Systems
with similar high availability and disaster recovery benefits as what is available for z/OS systems.
The automation capability of GDPS is unique and without peer in the distributed world.
6 SUMMARY
With the proliferation of intelligent phones and mobile computing, users increasingly have higher
expectations for availability and when service is unavailable, it is easier to share frustrations on social
media to friends. Users getting this information can cause dissatisfaction with a brand, even if they
personally were not even affected. This impacts customer retention and the bottom line profitability.
If given a choice between what infrastructure to place customer-facing and mission-critical applications,
one would want to choose the platform that can provide the most benefit for the corporation. Much
has been written about the Total Cost of Ownership (TCO) benefits of Linux on z System including what
is found at www.ibm.com/systems/z/os/linux/resources/doc_wp.html , even without considering the
availability impacts to cost. When one adds the benefits of a highly available and secure hardware base,
extreme virtualization that is also designed to share hardware resources, additional RAS customization
Unplanned outage avoidance by using “n+1” components is what one normally thinks of when thinking
about availability, the z System goes well beyond that. A partial list of availability features include:
• Power:
o N+1 Power subsystems
o N+1 Internal batteries
o Dual AC inputs
o Voltage transformation module (VTM) technology with triple redundancy on the VTM.
• Cooling
o Hybrid cooling system
o N+1 blowers
o Modular refrigeration units
• Cores:
o Dual instruction and execution with instruction retry
o Concurrently checkstop individual cores without outage
o Transparent CPU Sparing so if there is a problem with a core, then spares that come
with the server would detect this and take over. This would be invisible to the
applications and they would continue without any interruption.
o Point to point SMP fabric
• Memory:
o Redundant Array of Independent Memory (RAIM). Based on the RAID concept for disk,
memory can be set up to recover if there are any failures in a memory array. This
provides protection at the dynamic random access memory (DRAM), dual inline memory
module (DIMM), and memory channel levels.
o Extensive error detection and correction from DIMM level failures, including
components such as the controller application specific integrated circuit (ASIC), the
power regulators, the clocks, and the board
o Error detection and correction from Memory channel failures such as signal lines,
control lines, and drivers/receivers on the MCM
o ECC on memory, control circuitry, system memory data bus, and fabric controller
o Dynamic memory chip sparing
• Cache / Arrays
o Translation lookaside buffer retry / delete
o Redundant branch history table
o Concurrent L1 and L2 cache delete
o Concurrent L1 and L2 cache directory delete
o L1 and L2 cache relocate
o ECC for cache
• Input / Output
o FCP end to end checking
o Redundant I/O interconnect
o Multiple channel paths
o Redundant Ethernet service network w/ VLAN
o System Assist Processors (SAPs)
o Separate I/O CHPIDs
o Shared I/O capability
o Address limit checking
o Dynamic path reconnect
o Channel subsystem monitoring
• Security
o Integrated cryptographic accelerator
o Tamper-resistant Crypto Express feature
o Trusted Key Entry (TKE) 5.2 with optional Smart Card reader
o EAL Level 5 certified – the only platform that attained this level
• General
o Extensive testing of all parts, components, and system during the manufacturing phases
o Comprehensive field tracking
o Transparent Oscillator failover
o Automatic Support Element switchover
o Service processor reboot and sparing
o ECC on drawer interconnect
o Redundant drawer interconnect
o Frame Bolt Down Feature
o Storage Protection Keys
o FlashExpress (improved dump data capture)
• Power
• Concurrent internal battery maintenance
• Concurrent Power maintenance
• Cooling
• Concurrent Thermal maintenance
• Cores
• Concurrent processor book repair / add
• Transparent Oscillator maintenance
• Memory
• Concurrent memory repair / add
• Concurrent memory upgrade
• Concurrent memory bus adapter replacement
• Concurrent MBA hub upgrade
• Concurrent repair on all parts in an I/O cage
• Upgrade on any I/O card type
• Concurrently checkstop individual channels
• Concurrent STI repair
• Concurrent I/O cage controller maintenance
• Dynamic I/O reconfiguration
• Hot-pluggable I/O
• Transparent SAP sparing
• Dynamic SAP reassignment
• Dynamic I/O Enablement
• Security
• Dynamically add Crypto Express processor
• Concurrent Crypto-PCI upgrade
• General
• Concurrent Microcode (Firmware) updates – Install and Activate driver levels and
MicroCode Load (MCL) levels based upon bundle number while applications are still
running.
• Concurrent major LIC upgrades (CPUs, LPAR, channels), OSA, Power and Thermal,
Service Processor, HMC, … )
• Dynamic Swapping of Processor Types
• On/Off Capacity Upgrades on Demand (OOCUoD)
• Capacity Backup (CBU)
• Concurrent service processor maintenance
APPENDIX B – REFERENCES