Designing A PowerHA SystemMirror For AIX High Availability Solution - HA17 - Herrera

IBM Power Systems Technical University
October 18–22, 2010 — Las Vegas, NV
Session Title:
Designing a PowerHA SystemMirror for AIX High Availability Solution
Session ID: HA17(AIX)
Speaker Name: Michael Herrera
© 2010 IBM Corporation

Best Practices for Designing a PowerHA
SystemMirror for AIX High Availability Solution
Michael Herrera (mherrera@us.ibm.com)
Advanced Technical Skills (ATS)
Certified IT Specialist
+
Workload-Optimizing Systems
Agenda
• Common Misconceptions & Mistakes
• Infrastructure Considerations
• Differences in 7.1
• Virtualization & PowerHA SystemMirror
• Licensing Scenarios
• Cluster Management & Testing
• Summary
3
HACMP is now PowerHA SystemMirror for AIX!
HA & DR solutions from IBM for your mission-critical AIX applications
• Current Release: 7.1.0.X

– Available on: AIX 6.1 TL06 & 7.1
• Packaging Changes:
– Standard Edition - Local Availability
– Enterprise Edition - Local & Disaster Recovery
• Licensing Changes:
– Small, Medium, Large Server Class
Product Lifecycle:
Version Release Date End of Support Date
HACMP 5.4.1 Nov 6, 2007 Sept, 2011
PowerHA 5.5.0 Nov 14, 2008 N/A
PowerHA SystemMirror 6.1.0 Oct 20, 2009 N/A
PowerHA SystemMirror 7.1.0 Sept 10, 2010 N/A
* These dates are subject to change per Announcement Flash
4
PowerHA SystemMirror Minimum Requirements
PowerHA SystemMirror 7.1

• AIX 7.1 7.1.0.1 – Sep
• AIX 6.1 TL 6 - SP1
PowerHA SystemMirror 6.1

• AIX 7.1 6.1.0.2 – May 21
• AIX 6.1 TL 2 with RSCT 2.5.4.0
• AIX 5.3 TL 9, with RSCT 2.4.12.0
PowerHA Version 5.5

• AIX 7.1 5.5.0.6 – June 7
• AIX 6.1 TL2, SP1 and APAR IZ31208, RSCT 2.5.2.0
(Async GLVM - APARs IZ31205 and IZ31207)
• AIX 5L 5.3 TL 9 with RSCT version 2.4.10.0
HACMP 5.4.1
• AIX 6.1 with RSCT version 2.5.0.0 or higher 5.4.1.8 – May 13
• AIX 5.3 TL4 with RSCT version 2.4.5 (IY84920) or higher
• AIX 5.2 TL8 with RSCT version 2.3.9 (IY84921) or higher
5
Common Misconceptions
• PowerHA SystemMirror is an out of the box solution

– Scripting & Testing of application Start / Stop scripts
– Application monitors will also require scripting & testing
• PowerHA SystemMirror is installed we are completely protected

– Consider all single points of failure – ie. SAN, LAN, I/O drawers, etc..
• Heartbeats go over a dedicated link

– All interfaces defined to the cluster will pass heartbeats (IP & Non-IP)
– CAA definitely changes this behavior
• With clustering I need two of everything – hence idle resources
Fact:
Clustering will highlight what you are & are NOT doing right in your environment
6
Common Mistakes Beyond Base Cluster Functionality
• Down / Missing Serial Networks

• EtherChannel Links Down Lack of Education / Experience
• ERRORs in Verification • Not knowing Expected fallover behaviors
• Inconsistent AIX levels
• Down level cluster filesets • Lack of application monitoring
• Fallback Policy not set to desired behavior
• Missing Filesets • Not knowing what to monitor or check
• Missing Custom Disk Methods – CLI
• SAN not built in a robust fashion – Logs
• Bootlist Issues
• Dump Devices
– Insufficient size
– Mirrored Poor Change Controls
– Lack Secondary
• Not propagating changes appropriately
• I/O Pacing Enabled (old values)
• HBA Levels at GA code • No change history
• Fiber Channel Tunable settings not enabled
• Interim Fixes not loaded on all cluster nodes
Solutions:
• IBM Training / Redbooks / Proof of Concepts / ATS Health-check Reviews
7
Identify & Eliminate Points of Failure
• LAN Infrastructure
– Redundant Switches
• SAN Infrastructure
– Redundant Fabric
• Application Availability
– Application Monitoring
– Availability Reports
8
Infrastructure Considerations
Site A All links through one pipe Site B
LAN LAN
SAN SAN
DWDM DWDM
Node A Node B
50GB 50GB
50GB 50GB
SITEAMETROVG 50GB 50GB
50GB 50GB
Important:
Identify & Eliminate Single Points of Failure!
9
Site A XD_rs232 Site B

XD_IP WAN
net_ether_0
LAN LAN
SAN SAN
DWDM DWDM
Node A Node B
ECM VG: diskhb_vg1 ECM VG: diskhb_vg1

hdisk2 000fe4111f25a1d1 1GB hdisk3 000fe4111f25a1d1
ECM VG: diskhb_vg2 ECM VG: diskhb_vg2

hdisk3 000fe4112f998235 1GB hdisk4 000fe4112f998235
50GB 50GB
50GB 50GB
SITEAMETROVG 50GB 50GB
50GB 50GB
Important:
Identify Single Points of Failure & design the solution around them 10
• Power Redundancy
Real Customer Scenarios:
• I/O Drawers Ie 1. Two nodes sharing I/O drawer
1 I/O drawer 6
I/O drawer
• SCSI Backplane 2 7
3 I/O drawer 8
4 I/O drawer 9
• SAN HBAs 5 I/O drawer 10
Ie 2. Application Failure with no monitoring

• Virtualized Environments
– box remains up : no cluster fallover
• Application Fallover Protection
Moral of the Story:

* High Availability goes beyond just installing the cluster software
11
PowerHA SystemMirror 7.1: Topology management
Heartbeating differences between earlier cluster releases
diskhb_net1 LPAR 1 LPAR 1

diskhb_net2
LPAR 4 RSCT Subnet LPAR 2 LPAR 4 LPAR 2

Heartbeat Rings
Multicasting
diskhb_net4
LPAR 3
diskhb_net3 LPAR 3
PowerHA SM 6.1 & Earlier PowerHA SM 7.1 with CAA
RSCT Based Heartbeating Kernel Based Cluster Message Handling

• Leader, Successor, Mayor, etc.. • Multi cast based protocol
• Strict Subnet Rules • Use Network & SAN as needed
• No Heartbeating over HBAs • Discover and use as many adapters as
possible
Multiple Disk Heartbeat Networks • All monitors are implemented at low
• Point to Point only levels of the AIX Kernel & are largely
• Each Network requires LUN with ECM VG insensitive to system load
Single Repository Disk 12

• Used to heartbeat & store information
Transition of PowerHA Topology IP Networks
Network: Net_ether0
9.19.51.20 (service IP)

9.19.51.10 (persistent IP) (persistent IP) 9.19.51.11
192.168.100.1 (base address) en0 en0 ( base address) 192.168.100.2
VLAN HB Rings
9.19.51.21 (service IP) In 6.1 & below
Traditional heartbeating rules no longer apply. However, route stripping is still a potential issue. When
two interfaces have routable IPs on the same subnet AIX will send half the traffic out of either interface
Methods to circumvent this:

• Link Aggregation / EtherChannel
• Virtualized Interfaces with dual VIO servers
9.19.51.21 (service IP)

9.19.51.20 (service IP)
VLAN
ent0 ent1 ent0 ent1
13
PowerHA SM 7.1: Additional Heartbeating Differences
Heartbeating:
• Self Tuning Failure Detection Rate (FDR)
• All interfaces are used even if not in cluster networks
en3 en3
9.19.51.21 (service IP)
9.19.51.20 (service IP)
VLAN
ent0 ent1 ent0 ent1
Serial Networks removed:

• No more rs232 support
• No more traditional disk heartbeating over ECM VG
• No more slow takeover w/disk heartbeat device as last device on selective takeover
Critical Volume Groups

• Replace Multi-node Disk Heartbeating
– Oracle RAC three disk volume group - Voting Files
– Unlike MNDHB, no more general use
– Migration is a manual operation, and customer responsibility
– Any Concurrent Access Volume Group can be marked as “Critical” 14
CAA – Cluster Aware AIX
Enabling tighter Integration with PowerHA SystemMirror
What is it:
• A set of services/tools embedded in AIX to help manage a cluster of AIX
nodes and/or help run cluster software on AIX
• IBM cluster products (including RSCT, PowerHA, and the VIOS) will use
and/or call CAA services/tools
• CAA services can assist in the management and monitoring of an arbitrary
set of nodes and/or running a third-party cluster
• CAA does not form a cluster by itself. It is a tool set.

• There is no notion of quorum (If 20 nodes of a 21 node cluster are down,
CAA still runs on the remaining node)
• CAA does not eject nodes from a cluster. CAA provides tools to fence a
node but never fences a node and will continue to run on a fenced node
Major Benefits:
• Enhanced Health Management (Integrated Health Monitoring)
• Cluster Wide Device Naming
15
Cluster Aware AIX Exploiters
IBM
DB2
Director
RSCT Consumers
IBM PowerHA
TSA HMC HPC VIOS
Storage System Mirror
Legacy RSCT RSCT With Cluster Aware AIX

Bundled Resource Managers Bundled Resource Managers
Group Services Resource Mgr Services Group Services Resource Mgr Services
Messaging Monitoring Cluster Admin Messaging Monitoring Cluster Admin

API API UI API API UI
Cluster Cluster Cluster CFG Cluster Layers Integrated

Redesigned Cluster to CAA
Cluster CFG
Capabilities
Messaging Monitoring Repository Messaging Monitoring Repository
Cluster Aware AIX

CAA APIs and UIs
Legacy AIX
Cluster Cluster Cluster Cluster
Repository Monitoring Messaging Events
• RSCT and Cluster Aware AIX together provide the foundation of strategic Power Systems SW
• RSCT-CAA integration enables compatibility with a diverse set of dependent IBM products
• RSCT integration with CAA extends simplified cluster management along with optimized and robust
cluster monitoring, failure detection, and recovery to RSCT exploiters on Power / AIX
16
Cluster Aware AIX: Central Repository Disk
Contrast from previous releases
Aids in: Direction:

• Global Device Naming • In the first release, support is confined
• Inter node synchronization to shared storage
• Centrally managed configuration
• Will eventually evolve into a general
• Heartbeating device
AIX device rename interface
• Future direction is to enable cluster-
PowerHA SystemMirror 7.1 & CAA: wide storage policy settings
• PowerHA ODM will eventually also
Host 1 Host 2 Host 3 entirely move to the repository disk
PowerHA SystemMirror 6.1 & Prior:

Cluster Synchronization
HA ODM HA ODM HA ODM

Central
Repository
PowerHA SystemMirror will continue to Host 1 Host 2 Host 3

run if Central Repository Disk goes away
However, no changes may take place
within the cluster.
17
Multi Channel Health Management – Out of the Box
Hardened Environments with new communication protocol
Faster detection & more LPAR 1 LPAR 2

efficient communication
Reliable Reliable
Heartbeats Heartbeats
Messaging Messaging
First Line of Defense Network
Second Line of Defense SAN
Third Line of Defense Repository Disk
Highlights:
• RSCT Topology services no longer used for cluster Heartbeating
• All customers now have multiple communication paths by default
18
Basic Cluster vs. Advanced Cluster Features
IP Network Basic Cluster

– Network Topology
Resource Group – Resource Group/s
IP
• IPs
VGs • VGs
App Server • Application Server
– Application Monitoring
SAN Network
– Pager Events
Site A Site B Advanced Cluster

IP Network
– Multiple Networks – Smart Assistants
IP Network
• Crossover Connections – Multiple Sites
• Virtualized Resources – Cross Site LVM Configs
Resource Group Resource Group Resource Group – Multiple Resource Groups – Storage Replication
• Mutual Takeover
IP IP IP – IP Replication
VGs VGs VGs • Custom Resource Groups
App1 App2 Dev App – Adaptive Fallover – Application Monitoring
– NFS Cross-Mounts – Pager Events
SAN Network – File Collections – DLPAR Integration
– Dependencies • Grow LPAR on Fallover
• Parent / Child – Director Management
Disk Replication
• Location – WebSMIT Management
• Start After – Dynamic Node Priority
• Stop After
19
PowerHA SystemMirror: Fallover Possibilities
Cluster Scalable to 32 nodes
One to one One to any
Any to one Any to any 20

Methods to Circumvent Unused Resources
Resource Group A Resource Group C

Node A, Node B Node B, Node A
Shared IP Shared IP
VG/s & filesystems VG/s & filesystems
App 1 App 3
RG Mutual RG
Dependency Takeover Dependency
Resource Group B Resource Group D
Node A, Node B Node B, Node A
Shared IP Shared IP
VG/s & filesystems VG/s & filesystems
App 2 App 4
Virtualization
Frame 1 Frame 2
Node A NIC NIC Node B
{ }
HBA
HBA
rootvg hdisk2 VIO VIO hdisk1 rootvg
SAN SAN
NIC
{
NIC
oracle_vg1 }
HBA
hdisk4
HBA
hdisk4 oracle_vg1
VIO VIO
21
Storage Subsystem
Power Virtualization & PowerHA SystemMirror
Power HA Cluster
• LPAR / DLPAR
Power HA_node 1 Power HA_node 2
LPAR X LPAR Y LPAR Z
AIX
Rootvg
Data AIX
Rootvg
Data • Micropartitioning &
en0 vfc0 vfc1 en0 vfc0 vfc1
Shared Processor Pools
• Virtual I/O Server

VIO1 A VIO2 A VIO1 B VIO2 B – Virtual Ethernet
– Virtual SCSI
– Virtual Fiber
LAN
• Live Partition Mobility

SAN
• Active Memory Sharing

External
Storage Rootvg
Data
Enclosure volumes
• WPAR (AIX 6.1)
22
PowerHA SystemMirror Virtualization Considerations
• Ethernet Virtualization
– Topology should look the same as environment using link aggregation
– Version 7.1 no longer uses netmon.cf file
– As a best practice dual VIO Servers are recommended
• SEA Fallover Backend
• Storage Virtualization
– Both methods of virtualizing storage are supported
• VSCSI vs. Virtual fiber (NPIV)
– In DR implementations leveraging disk replication consider the
implications of using either option
• Benefits of virtualization:
– Maximize utilization of resources
– Less PCI slots & physical adapters
– Foundation for advanced functions like Live Partition Mobility
– Migrations to newer Power Hardware are simplified
* Live Partition Mobility & PowerHA SM compliment each other Chapter 2.4 PowerVM
Maintenance vs. High Availability Virtualization Considerations
(non-reactive . reactive)
23
Virtual Ethernet & PowerHA SystemMirror
No Link Aggregation / Same Frame
Virtual I/O Server (VIOS1) PowerHA LPAR 1 PowerHA LPAR 2 Virtual I/O Server (VIOS2)
ent4
ent4 en0 en0 en6 (SEA)
(SEA) en6
Control
Control Channel
Channel
ent0 ent0 ent6 ent5 ent2 ent0

ent0 ent2 ent5 ent6
(virt) (virt) (virt) (virt) (virt) (phy)
(phy) (virt) (virt) (virt)
PVID 99
PVID 10
Hypervisor
Frame 1
Ethernet Switch Ethernet Switch
This is a diagram of the configuration required for SEA fallover across VIO Servers. Note
that Ethernet traffic will not be load balanced across the VIO Servers. The lower trunk
priority on the “ent2” virtual adapter would designate the primary VIO Server to use.
24
Virtual Ethernet & PowerHA SystemMirror
Independent Frames & Link Aggregation
Virtual I/O Server (VIOS1) PowerHA LPAR 1 Virtual I/O Server (VIOS2)
ent3 ent4 ent4 ent3

en0 (SEA) (LA)
(LA) (SEA) Control Control
Channel Channel
Frame1 ent1 ent0 ent2 ent5 ent0 ent5 ent2 ent1 ent0
(phy) (phy) (virt) (virt) (virt) (virt) (virt) (phy) (phy)
Hypervisor
Ethernet Switch Ethernet Switch
Hypervisor
ent1 ent0 ent2 ent5 ent0 ent5 ent2 ent1 ent0

Frame2 Control
Control
Channel
Channel
ent3 ent4 en0 ent4 ent3
(LA) (SEA) (SEA) (LA)
Virtual I/O Server (VIOS1) PowerHA LPAR 2 Virtual I/O Server (VIOS2)
25
PowerHA SystemMirror 6.1 & Below
net_ether_0
9.19.51.20 (service IP) (service IP) 9.19.51.21
9.19.51.10 9.19.51.11
( base address) Topsvcs heartbeating ( base address)
en0 serial_net_0 en0
PowerHA Node 1 PowerHA Node 2

FRAME 1 FRAME 2
Hypervisor
ent1 ent0 ent2 ent5 ent0 ent5 ent2 ent1 ent0

FRAME X
Control
Control
Channel
Channel
ent3 ent4 en0 ent4 ent3
(LA) (SEA) (SEA) (LA)
Virtual I/O Server (VIOS1) AIX Client LPAR Virtual I/O Server (VIOS2)
* Netmon.cf file used for single adapter networks 26

PowerHA SystemMirror 7.1 - Topology
All nodes are monitored:

Cluster Aware AIX tells you what nodes are
in the cluster and information on those
nodes including state. A special “gossip”
protocol is used over the multicast address
to determine node information and
implement scalable reliable multicast. No
traditional heartbeat mechanism is
employed. Gossip packets travel over all
interfaces including storage.
Differences:
• RSCT Topology services is no longer used for heartbeat monitoring
• Subnet Requirements no longer need to be followed
• Netmon.cf file is no longer required or used
• All interfaces are used for monitoring even if they are not in an HA network
(this may be tunable in a future release)
IGMP Snooping must be enabled on the switches
27
VSCSI Mapping vs. NPIV (virtual fiber)
FRAME 1
VIOS 1 Node 1
NPIV
HBA
hdisk0 } rootvg
vscsi0
HBA
vhost0
Hypervisor
hdisk
MPIO
vscsi1
hdisk1
hdisk2
} vscsi_vg
HBA
hdisk vhost0 fcs0
LUNS
VSCSI
VIOS 2
NPIV
HBA
fcs1
MPIO hdisk3
hdisk4
} npiv_vg
LUNS
NPIV
FRAME 2
STORAGE
SUBSYSTEM VIOS 1 Node 2
NPIV
HBA
vscsi0
hdisk0 } rootvg
HBA
Hypervisor
vhost0 MPIO
}
hdisk hdisk1
vscsi1
vscsi_vg
hdisk2
HBA
hdisk vhost0 fcs0
VIOS 2
NPIV
HBA
fcs1
MPIO hdisk3
hdisk4
} npiv_vg
28
Live Partition Mobility Support with IBM PowerHA
How does it all work?
Frame 1 SAN Frame 2

Storage Considerations:
• This is a planned move
VIOS 1 rootvg VIOS 1 • It assumes that all resources

are virtualized through VIO
VIOS 2 VIOS 2 (Storage & Ethernet connections)
rootvg
PowerHA PowerHA • PowerHA should only experience

a minor disruption to the
Node 1 Node 2
datavg heartbeats during a move
PowerHA PowerHA • IVE / HEA virtual Ethernet is
Node 2 Node 1 not supported for LPM
• VSCSI & NPIV virtual fiber

mappings are supported
The two solutions compliment each other by providing the ability to perform non-disruptive
maintenance while retaining the ability to fallover in the event of a system or application outage
29
PowerHA and LPM Feature Comparison
Live
PowerHA Partition
SystemMirror Mobility
Live OS/App move between physical frames*
Server Workload Management**
Energy Management**
Hardware Maintenance
Software Maintenance
Automated failover upon System Failure (OS or HW)
Automated failover upon HW failure
Automated failover upon App failure
Automated failover upon vg access loss
Automated failover upon any specified AIX error

(via customized error notification of error report entry)
*~ 2 seconds of total interruption time

** Require free system resources on target system
30
PowerHA SystemMirror: DLPAR Value Proposition
Pros: Cons:
• Automated action on acquisition of • Requires Connectivity to HMC
resources (bound to the PowerHA
application server) • Potentially Slower Failover
System Specs:
– 32-way (2.3GHz) Squad-H+
• HMC Verification Checking for – 256GB of memory
connectivity to the HMC
Results:
– 120GB DLPAR add took 1min 55 sec
– 246GB DLPAR add took 4 min 25 sec
• Ability to Grow LPAR on Failover – 30% busy running artificial load the add took
4 minutes 36 seconds
• Save $ on PowerHA SM Licensing • Lacks ability to grow LPAR on-fly

–Thin Standby node
ssh communication ssh communication
LPAR A LPAR B
HMC HMC
DLPAR CPU Count
Minimal CPU Count Minimal CPU Count
Application Server
Backup 31
DLPAR Licensing Scenario
How does it all work?
System A System B
Cluster 1
Acquired + 1 CPU Oracle DB 1 CPU Standby 1 CPU
via DLPAR
Cluster 2
with App + 2 CPU Banner DB 1 CPU Standby 1 CPU
Cluster 3
Standby 1 CPU PeopleSoft 1 CPU + 1 CPU Acquired
Cluster 4 via DLPAR
Standby 1 CPU Financial DB 1 CPU + 2 CPU with App
Print Server 2 CPU TSM 2 CPU
Capacity 10 CPU Capacity 10 CPU
Power7 740 – 16 Way Power7 740 – 16 Way
Applications CPU Memory

Production Oracle DB 2 16 GB
Production PeopleSoft 2 8 GB
AIX Print Server 2 4 GB
Banner Financial DB 3 32 GB
Production Financial DB 3 32 GB
Tivoli Storage Manager 5.5.2.0 2 8 GB
32
Environment: PowerHA App Server Definitions
Application Server
Min 1
System A System B
Desired 2 Cluster 1
Oracle DB 1 CPU Standby 1 CPU
The actual application
Cluster 2
Application Server Banner DB 1 CPU Standby 1 CPU requirements are stored in the
Min 1
Cluster 3
PowerHA SystemMirror
Desired 3 Standby 1 CPU PeopleSoft 1 CPU definitions and enforced during
Cluster 4 the acquisition or release of
Standby 1 CPU Financial DB 1 CPU
application server resources
HMC
During acquisition of resources

in the cluster start up the host
System A System B will ssh to the pre-defined
HMC/s to perform the DLPAR
Cluster 1
Acquired + 1 CPU Oracle DB 1 CPU Standby 1 CPU operation automatically
via DLPAR
Cluster 2
Cluster 3
Cluster 4 via DLPAR
33
Environment: DLPAR Resource Processing Flow
1. Activate LPARs 2. Start PowerHA Read Requirements Activate LPARs

LPAR Profile Application Server Application Server LPAR Profile
Min 1 Min 1 Min 1 Min 1
Desired 1 Desired 2 Desired 2 Desired 1
Max 2 Max 2 Max 2 Max 2
LPAR Profile Application Server Application Server LPAR Profile

Min 1 Min 1 Min 1 Min 1
Desired 1 Desired 3 Desired 3 Desired 1
Max 3 Max 3 Max 3 Max 3
HMC
DLPAR DLPAR
System A System B
Cluster 1
- 1 CPU + 1 CPU Oracle DB 21 CPU Standby
Oracle DB 12CPU
CPU + 1 CPU - 1 CPU
Cluster 2
- 2 CPU + 2 CPU Banner DB 31 CPU Standby
Banner DB 1 3CPU
CPU + 2 CPU - 2 CPU
3. Release resources 4. Release resources

Fallover or RG_move Stop cluster without takeover
Take Aways:
• CPU allocations follow the application server wherever it is being hosted (this model allows you to lower the HA license count)
• DLPAR resources will only get processed during the acquisition or release of cluster resources
• PowerHA 6.1+ allows provide micro-partitioning support and the ability to also alter virtual processor counts
• DLPAR resources can come from free CPUs in shared processor pool or CoD resources 34
PowerHA SystemMirror: DLPAR Value Proposition
Environment using dedicated CPU model (No DLPAR)
System A System B
Cluster 1
Oracle DB 2 CPU Standby 2 CPU
PowerHA license counts:
Cluster 2
Banner DB 3 CPU Standby 3 CPU Cluster 1 : 4 CPUs
Cluster 3
Cluster 2 : 6 CPUs
Standby 2 CPU PeopleSoft 2 CPU Cluster 3 : 4 CPUs
Cluster 4 Cluster 4 : 6 CPUs
Standby 3 CPU Financial DB 3 CPU
Total : 20 licenses
HMC
Environment using DLPAR model Cluster 1 : 3 CPUs
Cluster 2 : 4 CPUs
System A System B Cluster 3 : 3 CPUs
Cluster 1
Cluster 4 : 4 CPUs
Acquired + 1 CPU Oracle DB 1 CPU Standby 1 CPU Total : 14 licenses
via DLPAR
Cluster 2
Cluster 3
Cluster 4 via DLPAR
35
PowerHA SystemMirror: DLPAR Modified Model
HMC
Environment using DLPAR model PowerHA license counts:
* Same as previous slide Cluster 1 : 3 CPUs
Cluster 2 : 4 CPUs
System A System B Cluster 3 : 3 CPUs
Cluster 1
Cluster 4 : 4 CPUs
Acquired + 1 CPU Oracle DB 1 CPU Standby 1 CPU Total : 14 licenses
via DLPAR
Cluster 2
Cluster 3
Cluster 4 via DLPAR
HMC
Environment using modified DLPAR model

Cluster 1 : 6 CPUs
System A System B
Cluster 2 : 6 CPUs
Acquired Cluster 1 Total : 12 licenses
Oracle DB
via DLPAR + 4 CPU Standby 1 CPU
with App Banner DB 1 CPU
Cluster2 PeopleSoft Acquired

Standby 1 CPU + 4 CPU via DLPAR
Financial DB 1 CPU with App
36
* Consolidated both Prod LPARs into one LPAR. Control separated by Resource Groups
Data Protection with PowerHA SM 7.1 & CAA
Enhanced Concurrent Mode Volume Groups are now required
ECM VGs were introduced in version 5.1

• Fast Disk Takeover
• Fast Failure Detection
• Disk heartbeating
Disk Fencing in CAA

• Fencing is automatic and transparent
• Cannot be turned off
• Fence group created by cl_vg_fence_init called from node_up
CAA Storage Framework fencing support

• Ability to specify level of disk access allowed by device driver
– Read/Write
– Read Only
– No Access (I/O is held until timeout)
– Fast Failure
37
Data Protection with PowerHA SM 7.1 & CAA
ECM Volume groups and the newly added protection
• LVM Enhanced Concurrent Mode VGs (Passive Mode)

• Prevent writes to logical volume or volume group devices
• Prevent filesystems from being mounted or any change requiring access to it
• CAA Fencing – prevents writes to the disk itself (ie. dd which runs below LVM level)
Node A Node B
/data/app read/write read only

CAA CAA
/data/db
Datavg /data No Access In the event
ECM VG Fail all I/Os of a failure
Cluster Cluster on node B
Services Services
ACTIVE PASSIVE
read/write read only
Shared
LUNs
38
Management Console: WebSMIT vs. IBM Director
CLI & SMIT sysmirror panels still the most common management interfaces
WebSMIT
• Available since HACMP 5.2
• Required web server to run on host until HACMP 5.5 (Gateway server)
• Did not fall in line with look and feel of other IBM offerings
IBM Systems Director Plug-in

• New for PowerHA
SystemMirror 7.1
• Only for management of
7.1 & above
• Same look and feel as IBM
suite of products
• Will leverage existing
Director implementation
• Uses clvt & clmgr CLI
behind the covers
39
WebSMIT Gateway Model: One-to-Many (6.1 & Below)
WebSMIT converted from a one-to-one architecture to one-to-many
User_1 User_2 User_3 User_4
Multiple WebSMIT users…

*One* WebSMIT server…
accessing multiple clusters…
managing multiple
through *one* WebSMIT
clusters…
server Standalone
WebSMIT
Server
Cluster_A Cluster_B Cluster_C
40
WebSMIT Screenshot: Associations Tab
41
PowerHA SystemMirror Cluster Management
New GUI User Interface for Version 7.1 Clusters
Three tier architecture provides scalability: User Interface

User Interface Management Server Director Agent Web-based interface
Command-line interface
Director Agent
Automatically installed on AIX 7.1 & AIX V6.1 TL06 AIX
PowerHA Director
P D Agent
P D
P D
Secure communication
P D
Director Server
Central point of control
P D
Supported on AIX,
Linux, and Windows
P D Agent manager
Discovery of clusters
and resources
P D
42
PowerHA SystemMirror Director Integration
• Accessing the SystemMirror Plug-ins
43
IBM Systems Director: Monitoring Status of Clusters
• Accessing the SystemMirror Plug-ins
44
PowerHA SystemMirror Configuration Wizards
• Wizards
45
PowerHA SystemMirror Smart Assistant Enhancements
Deploy HA Policy for Popular Middleware
46
PowerHA SystemMirror Detailed Views
• SystemMirror Management View
47
IBM Director: Management Dashboard
48
Do you know about clvt & clmgr ?
• “clmgr” announced in PowerHA SM 7.1

– clvt available since HACMP 5.4.1 for Smart Assists
– Hard linked “clmgr” to “clvt”
– Originally clmgr was intended for the Director team & rapidly evolved into a major,
unintended, informal line item.
– allows for deviation from clvt without breaking the Smart Assists
• From this release forward, only “clmgr” is supported for customer use
– clvt is strictly for use by the Smart Assists
• New Command Line Infrastructure

– Ease of Management
• Stop
• Start
• Move Resources
# clmgr on cluster Start Cluster Services on all nodes
# clmgr sync cluster Verify & Sync Cluster
# clmgr rg appAgroup node=node2 Move a resource group

49
Do you know about “clcmd” in CAA ?
Allows commands to be run across all cluster nodes
# lslpp -w /usr/sbin/clcmd
/usr/sbin/clcmd bos.cluster.rte
# clcmd lssrc -g caa # clcmd lspv

------------------------------- -------------------------------
NODE mutiny.dfw.ibm.com NODE mutiny.dfw.ibm.com
------------------------------- -------------------------------
Subsystem Group PID Status hdisk0 0004a99c161a7e45 rootvg active
clcomd caa 9502848 active caa_private0 0004a99cd90dba78 caavg_private active
cld caa 10551448 active hdisk2 0004a99c3b06bf99 None
clconfd caa 10092716 active hdisk3 0004a99c3b076c86 None
solid caa 7143642 active hdisk4 0004a99c3b076ce3 None
solidhac caa 7340248 active hdisk5 0004a99c3b076d2d None
------------------------------- -------------------------------
NODE munited.dfw.ibm.com NODE munited.dfw.ibm.com
------------------------------- -------------------------------
Subsystem Group PID Status hdisk0 0004a99c15ecf25d rootvg active
cld caa 4390916 active caa_private0 0004a99cd90dba78 caavg_private active
clcomd caa 4587668 active hdisk2 0004a99c3b06bf99 None
clconfd caa 6357196 active hdisk3 0004a99c3b076c86 None
solidhac caa 6094862 active hdisk4 0004a99c3b076ce3 None
solid caa 6553698 active hdisk5 0004a99c3b076d2d None
50
PowerHA SystemMirror: Sample Application Monitor
# cat /usr/local/hascripts/ora_monitor.sh
#!/bin/ksh
ps –ef | grep ora_pmon_hatest
51
PowerHA SystemMirror: Pager Events
HACMPpager:
methodname = "Herrera_notify"
desc = “Lab Systems Pager Event"
nodename = "connor kaitlyn"
dialnum = "mherrera@us.ibm.com"
filename = "/usr/es/sbin/cluster/samples/pager/sample.txt"
eventname = "acquire_takeover_addr config_too_long event_error
node_down_complete node_up_complete"
retrycnt = 3
timeout = 45
# cat /usr/es/sbin/cluster/samples/pager/sample.txt
Node %n: Event %e occurred at %d, object = %o
Action Taken: Halted Node Connor
Sample Email:
From: root 09/01/2009 Subject: HACMP

Node kaitlyn: Event acquire_takeover_addr occurred at Tue Sep 1 16:29:36 2009, object =
Attention:
Sendmail must be working and accessible via the firewall to receive notifications
52
PowerHA SystemMirror Tunables
• AIX I/O Pacing (High & Low Watermark)

– Typically only enable if recommended after performance evaluation
– Historical values 33 & 24 have been updated to 513 & 256 on AIX 5.3 and 8193 &
4096 on AIX 6.1
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/disk_io_pacing.htm
• Syncd Setting
– Default value of 60 recommended change to 10
• Failure Detection Rate (FDR) – only for Version 6.1 & below
– Normal Settings should suffice in most environments (note that it can be tuned further)
– Remember to enable FFD when using disk heartbeating
• Pre & Post Custom EVENTs

– Entry points for notifications or actions required before phases in a takeover
53
PowerHA SystemMirror: Testing Best Practices
• Test Application scripts and Application monitors thoroughly

– Common problems include edits to scripts within scripts
• Test fallovers in all directions

– Will confirm start & stop scripts on both locations
• Test Cluster
– Lpars within same frame
– Virtual resources
• Utilize Available Tools – Cluster Test Tool
• Testing upgrades “Alternate disk install” is your friend
Best Practice:
Testing should be the foundation for your documentation in the event that someone not
PowerHA savvy is there when a failure occurs.
54
How to be successful with PowerHA SystemMirror
• Strict Change Controls

– Available test environment
– Testing of changes
• Leverage CSPOC functions

– Create / Remove / Change - VGs, LVs, Filesystems
– User Administration
• Know what to look for

– cluster.log / hacmp.out / clstrmgr.debug log files
– /var/hacmp/log/clutils.log Summary of nightly verification
– /var/hacmp/clverify/clverify.log detailed verification output
munited /# cltopinfo -m
Interface Name Adapter Total Missed Current Missed
Address Heartbeats Heartbeats
--------------------------------------------------------------------------------------------------------------------
en0 192.168.1.103 0 0
rhdisk1 255.255.10.0 1 1
Cluster Services Uptime: 30 days 0 hours 31 minutes
55
Summary
• Review your infrastructure for potential single points of failure

– Be aware of the potential pitfalls listed in the common mistakes slide
• Leverage Features like:

– File Collections
– Application monitoring
– Pager Notification Events
• Keep up with feature changes in each release

– New dependencies & fallover behaviors
• Virtualizing P7 or P6 environments is the foundation for Live Partition Mobility

– NPIV capable adapters can help simplify the configuration & management
• WebSMIT & IBM Director are the available GUI front-ends

– The cluster release will determine which one to use
56
Learn More About PowerHA SystemMirror
PowerHA SystemMirror IBM Portal
Popular Topics:
* Frequently Asked Questions
* Customer References
* Documentation
* White Papers
http://www-03.ibm.com/systems/power/software/availability/aix/index.html
( … or Google ‘PowerHA SystemMirror’ and click I’m Feeling Lucky)
57
Questions?
Thank you for your time!

58
Additional Resources
• New - Disaster Recovery Redbook

SG24-7841 - Exploiting PowerHA SystemMirror Enterprise Edition for AIX
http://www.redbooks.ibm.com/abstracts/sg247841.html?Open
• New - RedGuide: High Availability and Disaster Recovery Planning: Next-Generation

Solutions for Multi server IBM Power Systems Environments
http://www.redbooks.ibm.com/abstracts/redp4669.html?Open
• Online Documentation
http://www-03.ibm.com/systems/p/library/hacmp_docs.html
• PowerHA SystemMirror Marketing Page

http://www-03.ibm.com/systems/p/ha/
• PowerHA SystemMirror Wiki Page

http://www-941.ibm.com/collaboration/wiki/display/WikiPtype/High+Availability
• PowerHA SystemMirror (“HACMP”) Redbooks

http://www.redbooks.ibm.com/cgi-bin/searchsite.cgi?query=hacmp
59

Designing A PowerHA SystemMirror For AIX High Availability Solution - HA17 - Herrera

Uploaded by

Copyright:

Available Formats

Designing A PowerHA SystemMirror For AIX High Availability Solution - HA17 - Herrera

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Designing A PowerHA SystemMirror For AIX High Availability Solution - HA17 - Herrera

Uploaded by

Copyright:

Available Formats

IBM Power Systems Technical University

October 18–22, 2010 — Las Vegas, NV

Session ID: HA17(AIX)

Speaker Name: Michael Herrera

© 2010 IBM Corporation

• Common Misconceptions & Mistakes

• Virtualization & PowerHA SystemMirror

• Cluster Management & Testing

• Current Release: 7.1.0.X

PowerHA SystemMirror 7.1

PowerHA SystemMirror 6.1

PowerHA Version 5.5

• PowerHA SystemMirror is an out of the box solution

• PowerHA SystemMirror is installed we are completely protected

• Heartbeats go over a dedicated link

• With clustering I need two of everything – hence idle resources

• Down / Missing Serial Networks

Site A All links through one pipe Site B

SITEAMETROVG 50GB 50GB

Site A XD_rs232 Site B

ECM VG: diskhb_vg1 ECM VG: diskhb_vg1

ECM VG: diskhb_vg2 ECM VG: diskhb_vg2

SITEAMETROVG 50GB 50GB

Ie 2. Application Failure with no monitoring

• Application Fallover Protection

Moral of the Story:

diskhb_net1 LPAR 1 LPAR 1

LPAR 4 RSCT Subnet LPAR 2 LPAR 4 LPAR 2

PowerHA SM 6.1 & Earlier PowerHA SM 7.1 with CAA

RSCT Based Heartbeating Kernel Based Cluster Message Handling

Single Repository Disk 12

9.19.51.20 (service IP)

Methods to circumvent this:

9.19.51.21 (service IP)

ent0 ent1 ent0 ent1

ent0 ent1 ent0 ent1

Serial Networks removed:

Critical Volume Groups

• CAA does not form a cluster by itself. It is a tool set.

Legacy RSCT RSCT With Cluster Aware AIX

Messaging Monitoring Cluster Admin Messaging Monitoring Cluster Admin

Cluster Cluster Cluster CFG Cluster Layers Integrated

Cluster Aware AIX

Aids in: Direction:

PowerHA SystemMirror 6.1 & Prior:

HA ODM HA ODM HA ODM

PowerHA SystemMirror will continue to Host 1 Host 2 Host 3

Faster detection & more LPAR 1 LPAR 2

First Line of Defense Network

Second Line of Defense SAN

Third Line of Defense Repository Disk

IP Network Basic Cluster

Site A Site B Advanced Cluster

One to one One to any

Any to one Any to any 20

Resource Group A Resource Group C

• Virtual I/O Server

• Live Partition Mobility

• Active Memory Sharing