Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Designing A PowerHA SystemMirror For AIX High Availability Solution - HA17 - Herrera

Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

IBM Power Systems Technical University

October 18–22, 2010 — Las Vegas, NV

Session Title:
Designing a PowerHA SystemMirror for AIX High Availability Solution

Session ID: HA17(AIX)

Speaker Name: Michael Herrera

© 2010 IBM Corporation


Best Practices for Designing a PowerHA
SystemMirror for AIX High Availability Solution
Michael Herrera (mherrera@us.ibm.com)
Advanced Technical Skills (ATS)
Certified IT Specialist

+
Workload-Optimizing Systems
Agenda

• Common Misconceptions & Mistakes

• Infrastructure Considerations

• Differences in 7.1

• Virtualization & PowerHA SystemMirror

• Licensing Scenarios

• Cluster Management & Testing

• Summary

3
HACMP is now PowerHA SystemMirror for AIX!
HA & DR solutions from IBM for your mission-critical AIX applications

• Current Release: 7.1.0.X


– Available on: AIX 6.1 TL06 & 7.1

• Packaging Changes:
– Standard Edition - Local Availability
– Enterprise Edition - Local & Disaster Recovery

• Licensing Changes:
– Small, Medium, Large Server Class

Product Lifecycle:
Version Release Date End of Support Date
HACMP 5.4.1 Nov 6, 2007 Sept, 2011
PowerHA 5.5.0 Nov 14, 2008 N/A
PowerHA SystemMirror 6.1.0 Oct 20, 2009 N/A
PowerHA SystemMirror 7.1.0 Sept 10, 2010 N/A
* These dates are subject to change per Announcement Flash
4
PowerHA SystemMirror Minimum Requirements

PowerHA SystemMirror 7.1


• AIX 7.1 7.1.0.1 – Sep
• AIX 6.1 TL 6 - SP1

PowerHA SystemMirror 6.1


• AIX 7.1 6.1.0.2 – May 21
• AIX 6.1 TL 2 with RSCT 2.5.4.0
• AIX 5.3 TL 9, with RSCT 2.4.12.0

PowerHA Version 5.5


• AIX 7.1 5.5.0.6 – June 7
• AIX 6.1 TL2, SP1 and APAR IZ31208, RSCT 2.5.2.0
(Async GLVM - APARs IZ31205 and IZ31207)
• AIX 5L 5.3 TL 9 with RSCT version 2.4.10.0

HACMP 5.4.1
• AIX 6.1 with RSCT version 2.5.0.0 or higher 5.4.1.8 – May 13
• AIX 5.3 TL4 with RSCT version 2.4.5 (IY84920) or higher
• AIX 5.2 TL8 with RSCT version 2.3.9 (IY84921) or higher
5
Common Misconceptions

• PowerHA SystemMirror is an out of the box solution


– Scripting & Testing of application Start / Stop scripts
– Application monitors will also require scripting & testing

• PowerHA SystemMirror is installed we are completely protected


– Consider all single points of failure – ie. SAN, LAN, I/O drawers, etc..

• Heartbeats go over a dedicated link


– All interfaces defined to the cluster will pass heartbeats (IP & Non-IP)
– CAA definitely changes this behavior

• With clustering I need two of everything – hence idle resources

Fact:
Clustering will highlight what you are & are NOT doing right in your environment

6
Common Mistakes Beyond Base Cluster Functionality

• Down / Missing Serial Networks


• EtherChannel Links Down Lack of Education / Experience
• ERRORs in Verification • Not knowing Expected fallover behaviors
• Inconsistent AIX levels
• Down level cluster filesets • Lack of application monitoring
• Fallback Policy not set to desired behavior
• Missing Filesets • Not knowing what to monitor or check
• Missing Custom Disk Methods – CLI
• SAN not built in a robust fashion – Logs
• Bootlist Issues
• Dump Devices
– Insufficient size
– Mirrored Poor Change Controls
– Lack Secondary
• Not propagating changes appropriately
• I/O Pacing Enabled (old values)
• HBA Levels at GA code • No change history
• Fiber Channel Tunable settings not enabled
• Interim Fixes not loaded on all cluster nodes

Solutions:
• IBM Training / Redbooks / Proof of Concepts / ATS Health-check Reviews
7
Identify & Eliminate Points of Failure

• LAN Infrastructure
– Redundant Switches

• SAN Infrastructure
– Redundant Fabric

• Application Availability
– Application Monitoring
– Availability Reports

8
Infrastructure Considerations

Site A All links through one pipe Site B

LAN LAN

SAN SAN
DWDM DWDM

Node A Node B

50GB 50GB

50GB 50GB

SITEAMETROVG 50GB 50GB

50GB 50GB

Important:
Identify & Eliminate Single Points of Failure!
9
Infrastructure Considerations

Site A XD_rs232 Site B


XD_IP WAN

net_ether_0

LAN LAN

SAN SAN
DWDM DWDM

Node A Node B

ECM VG: diskhb_vg1 ECM VG: diskhb_vg1


hdisk2 000fe4111f25a1d1 1GB hdisk3 000fe4111f25a1d1

ECM VG: diskhb_vg2 ECM VG: diskhb_vg2


hdisk3 000fe4112f998235 1GB hdisk4 000fe4112f998235

50GB 50GB

50GB 50GB

SITEAMETROVG 50GB 50GB

50GB 50GB

Important:
Identify Single Points of Failure & design the solution around them 10
Infrastructure Considerations

• Power Redundancy
Real Customer Scenarios:
• I/O Drawers Ie 1. Two nodes sharing I/O drawer
1 I/O drawer 6
I/O drawer
• SCSI Backplane 2 7

3 I/O drawer 8

4 I/O drawer 9
• SAN HBAs 5 I/O drawer 10

Ie 2. Application Failure with no monitoring


• Virtualized Environments
– box remains up : no cluster fallover

• Application Fallover Protection

Moral of the Story:


* High Availability goes beyond just installing the cluster software

11
PowerHA SystemMirror 7.1: Topology management
Heartbeating differences between earlier cluster releases

diskhb_net1 LPAR 1 LPAR 1


diskhb_net2

LPAR 4 RSCT Subnet LPAR 2 LPAR 4 LPAR 2


Heartbeat Rings

Multicasting

diskhb_net4
LPAR 3
diskhb_net3 LPAR 3

PowerHA SM 6.1 & Earlier PowerHA SM 7.1 with CAA

RSCT Based Heartbeating Kernel Based Cluster Message Handling


• Leader, Successor, Mayor, etc.. • Multi cast based protocol
• Strict Subnet Rules • Use Network & SAN as needed
• No Heartbeating over HBAs • Discover and use as many adapters as
possible
Multiple Disk Heartbeat Networks • All monitors are implemented at low
• Point to Point only levels of the AIX Kernel & are largely
• Each Network requires LUN with ECM VG insensitive to system load

Single Repository Disk 12


• Used to heartbeat & store information
Transition of PowerHA Topology IP Networks

Network: Net_ether0

9.19.51.20 (service IP)


9.19.51.10 (persistent IP) (persistent IP) 9.19.51.11
192.168.100.1 (base address) en0 en0 ( base address) 192.168.100.2

VLAN HB Rings
9.19.51.21 (service IP) In 6.1 & below
192.168.101.1 (base address) en1 en1 ( base address) 192.168.101.2

Traditional heartbeating rules no longer apply. However, route stripping is still a potential issue. When
two interfaces have routable IPs on the same subnet AIX will send half the traffic out of either interface

Methods to circumvent this:


• Link Aggregation / EtherChannel
• Virtualized Interfaces with dual VIO servers

9.19.51.21 (service IP)


9.19.51.20 (service IP)
9.19.51.10 (base address) en2 en2 ( base address) 9.19.51.11
VLAN

ent0 ent1 ent0 ent1

13
PowerHA SM 7.1: Additional Heartbeating Differences
Heartbeating:
• Self Tuning Failure Detection Rate (FDR)
• All interfaces are used even if not in cluster networks

en3 en3
9.19.51.21 (service IP)
9.19.51.20 (service IP)
9.19.51.10 (base address) en2 en2 ( base address) 9.19.51.11
VLAN

ent0 ent1 ent0 ent1

Serial Networks removed:


• No more rs232 support
• No more traditional disk heartbeating over ECM VG
• No more slow takeover w/disk heartbeat device as last device on selective takeover

Critical Volume Groups


• Replace Multi-node Disk Heartbeating
– Oracle RAC three disk volume group - Voting Files
– Unlike MNDHB, no more general use
– Migration is a manual operation, and customer responsibility
– Any Concurrent Access Volume Group can be marked as “Critical” 14
CAA – Cluster Aware AIX
Enabling tighter Integration with PowerHA SystemMirror

What is it:
• A set of services/tools embedded in AIX to help manage a cluster of AIX
nodes and/or help run cluster software on AIX
• IBM cluster products (including RSCT, PowerHA, and the VIOS) will use
and/or call CAA services/tools
• CAA services can assist in the management and monitoring of an arbitrary
set of nodes and/or running a third-party cluster

• CAA does not form a cluster by itself. It is a tool set.


• There is no notion of quorum (If 20 nodes of a 21 node cluster are down,
CAA still runs on the remaining node)
• CAA does not eject nodes from a cluster. CAA provides tools to fence a
node but never fences a node and will continue to run on a fenced node

Major Benefits:
• Enhanced Health Management (Integrated Health Monitoring)
• Cluster Wide Device Naming
15
Cluster Aware AIX Exploiters

IBM
DB2
Director
RSCT Consumers
IBM PowerHA
TSA HMC HPC VIOS
Storage System Mirror

Legacy RSCT RSCT With Cluster Aware AIX


Bundled Resource Managers Bundled Resource Managers

Group Services Resource Mgr Services Group Services Resource Mgr Services

Messaging Monitoring Cluster Admin Messaging Monitoring Cluster Admin


API API UI API API UI

Cluster Cluster Cluster CFG Cluster Layers Integrated


Redesigned Cluster to CAA
Cluster CFG
Capabilities
Messaging Monitoring Repository Messaging Monitoring Repository

Cluster Aware AIX


CAA APIs and UIs
Legacy AIX
Cluster Cluster Cluster Cluster
Repository Monitoring Messaging Events

• RSCT and Cluster Aware AIX together provide the foundation of strategic Power Systems SW

• RSCT-CAA integration enables compatibility with a diverse set of dependent IBM products

• RSCT integration with CAA extends simplified cluster management along with optimized and robust
cluster monitoring, failure detection, and recovery to RSCT exploiters on Power / AIX
16
Cluster Aware AIX: Central Repository Disk
Contrast from previous releases

Aids in: Direction:


• Global Device Naming • In the first release, support is confined
• Inter node synchronization to shared storage
• Centrally managed configuration
• Will eventually evolve into a general
• Heartbeating device
AIX device rename interface
• Future direction is to enable cluster-
PowerHA SystemMirror 7.1 & CAA: wide storage policy settings
• PowerHA ODM will eventually also
Host 1 Host 2 Host 3 entirely move to the repository disk

PowerHA SystemMirror 6.1 & Prior:


Cluster Synchronization

HA ODM HA ODM HA ODM


Central
Repository

PowerHA SystemMirror will continue to Host 1 Host 2 Host 3


run if Central Repository Disk goes away
However, no changes may take place
within the cluster.

17
Multi Channel Health Management – Out of the Box
Hardened Environments with new communication protocol

Faster detection & more LPAR 1 LPAR 2


efficient communication

Reliable Reliable
Heartbeats Heartbeats
Messaging Messaging

First Line of Defense Network

Second Line of Defense SAN

Third Line of Defense Repository Disk

Highlights:
• RSCT Topology services no longer used for cluster Heartbeating
• All customers now have multiple communication paths by default
18
Basic Cluster vs. Advanced Cluster Features

IP Network Basic Cluster


– Network Topology
Resource Group – Resource Group/s
IP
• IPs
VGs • VGs
App Server • Application Server
– Application Monitoring
SAN Network
– Pager Events

Site A Site B Advanced Cluster


IP Network
– Multiple Networks – Smart Assistants
IP Network
• Crossover Connections – Multiple Sites
• Virtualized Resources – Cross Site LVM Configs
Resource Group Resource Group Resource Group – Multiple Resource Groups – Storage Replication
• Mutual Takeover
IP IP IP – IP Replication
VGs VGs VGs • Custom Resource Groups
App1 App2 Dev App – Adaptive Fallover – Application Monitoring
– NFS Cross-Mounts – Pager Events
SAN Network – File Collections – DLPAR Integration
– Dependencies • Grow LPAR on Fallover
• Parent / Child – Director Management
Disk Replication
• Location – WebSMIT Management
• Start After – Dynamic Node Priority
• Stop After
19
PowerHA SystemMirror: Fallover Possibilities
Cluster Scalable to 32 nodes

One to one One to any

Any to one Any to any 20


Methods to Circumvent Unused Resources

Resource Group A Resource Group C


Node A, Node B Node B, Node A
Shared IP Shared IP
VG/s & filesystems VG/s & filesystems
App 1 App 3
RG Mutual RG
Dependency Takeover Dependency
Resource Group B Resource Group D
Node A, Node B Node B, Node A
Shared IP Shared IP
VG/s & filesystems VG/s & filesystems
App 2 App 4

Virtualization
Frame 1 Frame 2
Node A NIC NIC Node B

{ }
HBA

HBA
rootvg hdisk2 VIO VIO hdisk1 rootvg
SAN SAN
NIC

{
NIC

oracle_vg1 }
HBA

hdisk4
HBA
hdisk4 oracle_vg1
VIO VIO

21
Storage Subsystem
Power Virtualization & PowerHA SystemMirror

Power HA Cluster
• LPAR / DLPAR
Power HA_node 1 Power HA_node 2
LPAR X LPAR Y LPAR Z

AIX
Rootvg
Data AIX
Rootvg
Data • Micropartitioning &
en0 vfc0 vfc1 en0 vfc0 vfc1
Shared Processor Pools

• Virtual I/O Server


VIO1 A VIO2 A VIO1 B VIO2 B – Virtual Ethernet
– Virtual SCSI
– Virtual Fiber
LAN

• Live Partition Mobility


SAN

• Active Memory Sharing


External
Storage Rootvg
Data
Enclosure volumes

• WPAR (AIX 6.1)

22
PowerHA SystemMirror Virtualization Considerations

• Ethernet Virtualization
– Topology should look the same as environment using link aggregation
– Version 7.1 no longer uses netmon.cf file
– As a best practice dual VIO Servers are recommended
• SEA Fallover Backend

• Storage Virtualization
– Both methods of virtualizing storage are supported
• VSCSI vs. Virtual fiber (NPIV)
– In DR implementations leveraging disk replication consider the
implications of using either option

• Benefits of virtualization:
– Maximize utilization of resources
– Less PCI slots & physical adapters
– Foundation for advanced functions like Live Partition Mobility
– Migrations to newer Power Hardware are simplified

* Live Partition Mobility & PowerHA SM compliment each other Chapter 2.4 PowerVM
Maintenance vs. High Availability Virtualization Considerations
(non-reactive . reactive)
23
Virtual Ethernet & PowerHA SystemMirror
No Link Aggregation / Same Frame

Virtual I/O Server (VIOS1) PowerHA LPAR 1 PowerHA LPAR 2 Virtual I/O Server (VIOS2)

ent4
ent4 en0 en0 en6 (SEA)
(SEA) en6
Control
Control Channel
Channel

ent0 ent0 ent6 ent5 ent2 ent0


ent0 ent2 ent5 ent6
(virt) (virt) (virt) (virt) (virt) (phy)
(phy) (virt) (virt) (virt)

PVID 99

PVID 10
Hypervisor

Frame 1

Ethernet Switch Ethernet Switch

This is a diagram of the configuration required for SEA fallover across VIO Servers. Note
that Ethernet traffic will not be load balanced across the VIO Servers. The lower trunk
priority on the “ent2” virtual adapter would designate the primary VIO Server to use.
24
Virtual Ethernet & PowerHA SystemMirror
Independent Frames & Link Aggregation

Virtual I/O Server (VIOS1) PowerHA LPAR 1 Virtual I/O Server (VIOS2)

ent3 ent4 ent4 ent3


en0 (SEA) (LA)
(LA) (SEA) Control Control
Channel Channel

Frame1 ent1 ent0 ent2 ent5 ent0 ent5 ent2 ent1 ent0
(phy) (phy) (virt) (virt) (virt) (virt) (virt) (phy) (phy)

Hypervisor

Ethernet Switch Ethernet Switch

Hypervisor

ent1 ent0 ent2 ent5 ent0 ent5 ent2 ent1 ent0


(phy) (phy) (virt) (virt) (virt) (virt) (virt) (phy) (phy)
Frame2 Control
Control
Channel
Channel
ent3 ent4 en0 ent4 ent3
(LA) (SEA) (SEA) (LA)

Virtual I/O Server (VIOS1) PowerHA LPAR 2 Virtual I/O Server (VIOS2)

25
PowerHA SystemMirror 6.1 & Below

net_ether_0

9.19.51.20 (service IP) (service IP) 9.19.51.21

9.19.51.10 9.19.51.11
( base address) Topsvcs heartbeating ( base address)

en0 serial_net_0 en0

PowerHA Node 1 PowerHA Node 2


FRAME 1 FRAME 2

Hypervisor

ent1 ent0 ent2 ent5 ent0 ent5 ent2 ent1 ent0


(phy) (phy) (virt) (virt) (virt) (virt) (virt) (phy) (phy)
FRAME X
Control
Control
Channel
Channel
ent3 ent4 en0 ent4 ent3
(LA) (SEA) (SEA) (LA)

Virtual I/O Server (VIOS1) AIX Client LPAR Virtual I/O Server (VIOS2)

* Netmon.cf file used for single adapter networks 26


PowerHA SystemMirror 7.1 - Topology

All nodes are monitored:


Cluster Aware AIX tells you what nodes are
in the cluster and information on those
nodes including state. A special “gossip”
protocol is used over the multicast address
to determine node information and
implement scalable reliable multicast. No
traditional heartbeat mechanism is
employed. Gossip packets travel over all
interfaces including storage.

Differences:
• RSCT Topology services is no longer used for heartbeat monitoring
• Subnet Requirements no longer need to be followed
• Netmon.cf file is no longer required or used
• All interfaces are used for monitoring even if they are not in an HA network
(this may be tunable in a future release)
IGMP Snooping must be enabled on the switches

27
VSCSI Mapping vs. NPIV (virtual fiber)

FRAME 1
VIOS 1 Node 1
NPIV
HBA
hdisk0 } rootvg
vscsi0
HBA
vhost0

Hypervisor
hdisk
MPIO
vscsi1
hdisk1

hdisk2
} vscsi_vg
HBA

hdisk vhost0 fcs0

LUNS
VSCSI
VIOS 2
NPIV
HBA
fcs1
MPIO hdisk3

hdisk4
} npiv_vg

LUNS
NPIV
FRAME 2
STORAGE
SUBSYSTEM VIOS 1 Node 2
NPIV
HBA
vscsi0
hdisk0 } rootvg
HBA

Hypervisor

vhost0 MPIO
}
hdisk hdisk1
vscsi1
vscsi_vg
hdisk2
HBA

hdisk vhost0 fcs0

VIOS 2
NPIV
HBA
fcs1
MPIO hdisk3

hdisk4
} npiv_vg

28
Live Partition Mobility Support with IBM PowerHA
How does it all work?

Frame 1 SAN Frame 2


Storage Considerations:
• This is a planned move

VIOS 1 rootvg VIOS 1 • It assumes that all resources


are virtualized through VIO
VIOS 2 VIOS 2 (Storage & Ethernet connections)
rootvg

PowerHA PowerHA • PowerHA should only experience


a minor disruption to the
Node 1 Node 2
datavg heartbeats during a move
PowerHA PowerHA • IVE / HEA virtual Ethernet is
Node 2 Node 1 not supported for LPM

• VSCSI & NPIV virtual fiber


mappings are supported

The two solutions compliment each other by providing the ability to perform non-disruptive
maintenance while retaining the ability to fallover in the event of a system or application outage

29
PowerHA and LPM Feature Comparison

Live
PowerHA Partition
SystemMirror Mobility
Live OS/App move between physical frames* 

Server Workload Management** 

Energy Management** 

Hardware Maintenance  

Software Maintenance 

Automated failover upon System Failure (OS or HW) 

Automated failover upon HW failure 

Automated failover upon App failure 

Automated failover upon vg access loss 

Automated failover upon any specified AIX error 


(via customized error notification of error report entry)

*~ 2 seconds of total interruption time


** Require free system resources on target system

30
PowerHA SystemMirror: DLPAR Value Proposition

Pros: Cons:
• Automated action on acquisition of • Requires Connectivity to HMC
resources (bound to the PowerHA
application server) • Potentially Slower Failover
System Specs:
– 32-way (2.3GHz) Squad-H+
• HMC Verification Checking for – 256GB of memory
connectivity to the HMC
Results:
– 120GB DLPAR add took 1min 55 sec
– 246GB DLPAR add took 4 min 25 sec
• Ability to Grow LPAR on Failover – 30% busy running artificial load the add took
4 minutes 36 seconds

• Save $ on PowerHA SM Licensing • Lacks ability to grow LPAR on-fly


–Thin Standby node

ssh communication ssh communication

LPAR A LPAR B
HMC HMC
DLPAR CPU Count
Minimal CPU Count Minimal CPU Count
Application Server

Backup 31
DLPAR Licensing Scenario
How does it all work?

System A System B
Cluster 1
Acquired + 1 CPU Oracle DB 1 CPU Standby 1 CPU
via DLPAR
Cluster 2
with App + 2 CPU Banner DB 1 CPU Standby 1 CPU
Cluster 3
Standby 1 CPU PeopleSoft 1 CPU + 1 CPU Acquired
Cluster 4 via DLPAR
Standby 1 CPU Financial DB 1 CPU + 2 CPU with App

Print Server 2 CPU TSM 2 CPU

Capacity 10 CPU Capacity 10 CPU

Power7 740 – 16 Way Power7 740 – 16 Way

Applications CPU Memory


Production Oracle DB 2 16 GB

Production PeopleSoft 2 8 GB

AIX Print Server 2 4 GB

Banner Financial DB 3 32 GB

Production Financial DB 3 32 GB

Tivoli Storage Manager 5.5.2.0 2 8 GB

32
Environment: PowerHA App Server Definitions

Application Server
Min 1
System A System B
Desired 2 Cluster 1
Oracle DB 1 CPU Standby 1 CPU
The actual application
Cluster 2
Application Server Banner DB 1 CPU Standby 1 CPU requirements are stored in the
Min 1
Cluster 3
PowerHA SystemMirror
Desired 3 Standby 1 CPU PeopleSoft 1 CPU definitions and enforced during
Cluster 4 the acquisition or release of
Standby 1 CPU Financial DB 1 CPU
application server resources

HMC

During acquisition of resources


in the cluster start up the host
System A System B will ssh to the pre-defined
HMC/s to perform the DLPAR
Cluster 1
Acquired + 1 CPU Oracle DB 1 CPU Standby 1 CPU operation automatically
via DLPAR
Cluster 2
with App + 2 CPU Banner DB 1 CPU Standby 1 CPU
Cluster 3
Standby 1 CPU PeopleSoft 1 CPU + 1 CPU Acquired
Cluster 4 via DLPAR
Standby 1 CPU Financial DB 1 CPU + 2 CPU with App

33
Environment: DLPAR Resource Processing Flow

1. Activate LPARs 2. Start PowerHA Read Requirements Activate LPARs


LPAR Profile Application Server Application Server LPAR Profile
Min 1 Min 1 Min 1 Min 1
Desired 1 Desired 2 Desired 2 Desired 1
Max 2 Max 2 Max 2 Max 2

LPAR Profile Application Server Application Server LPAR Profile


Min 1 Min 1 Min 1 Min 1
Desired 1 Desired 3 Desired 3 Desired 1
Max 3 Max 3 Max 3 Max 3

HMC

DLPAR DLPAR
System A System B
Cluster 1
- 1 CPU + 1 CPU Oracle DB 21 CPU Standby
Oracle DB 12CPU
CPU + 1 CPU - 1 CPU
Cluster 2
- 2 CPU + 2 CPU Banner DB 31 CPU Standby
Banner DB 1 3CPU
CPU + 2 CPU - 2 CPU

3. Release resources 4. Release resources


Fallover or RG_move Stop cluster without takeover

Take Aways:
• CPU allocations follow the application server wherever it is being hosted (this model allows you to lower the HA license count)
• DLPAR resources will only get processed during the acquisition or release of cluster resources
• PowerHA 6.1+ allows provide micro-partitioning support and the ability to also alter virtual processor counts
• DLPAR resources can come from free CPUs in shared processor pool or CoD resources 34
PowerHA SystemMirror: DLPAR Value Proposition

Environment using dedicated CPU model (No DLPAR)

System A System B
Cluster 1
Oracle DB 2 CPU Standby 2 CPU
PowerHA license counts:
Cluster 2
Banner DB 3 CPU Standby 3 CPU Cluster 1 : 4 CPUs
Cluster 3
Cluster 2 : 6 CPUs
Standby 2 CPU PeopleSoft 2 CPU Cluster 3 : 4 CPUs
Cluster 4 Cluster 4 : 6 CPUs
Standby 3 CPU Financial DB 3 CPU
Total : 20 licenses

HMC
PowerHA license counts:
Environment using DLPAR model Cluster 1 : 3 CPUs
Cluster 2 : 4 CPUs
System A System B Cluster 3 : 3 CPUs
Cluster 1
Cluster 4 : 4 CPUs
Acquired + 1 CPU Oracle DB 1 CPU Standby 1 CPU Total : 14 licenses
via DLPAR
Cluster 2
with App + 2 CPU Banner DB 1 CPU Standby 1 CPU
Cluster 3
Standby 1 CPU PeopleSoft 1 CPU + 1 CPU Acquired
Cluster 4 via DLPAR
Standby 1 CPU Financial DB 1 CPU + 2 CPU with App

35
PowerHA SystemMirror: DLPAR Modified Model
HMC
Environment using DLPAR model PowerHA license counts:
* Same as previous slide Cluster 1 : 3 CPUs
Cluster 2 : 4 CPUs
System A System B Cluster 3 : 3 CPUs
Cluster 1
Cluster 4 : 4 CPUs
Acquired + 1 CPU Oracle DB 1 CPU Standby 1 CPU Total : 14 licenses
via DLPAR
Cluster 2
with App + 2 CPU Banner DB 1 CPU Standby 1 CPU
Cluster 3
Standby 1 CPU PeopleSoft 1 CPU + 1 CPU Acquired
Cluster 4 via DLPAR
Standby 1 CPU Financial DB 1 CPU + 2 CPU with App

HMC

Environment using modified DLPAR model


PowerHA license counts:
Cluster 1 : 6 CPUs
System A System B
Cluster 2 : 6 CPUs
Acquired Cluster 1 Total : 12 licenses
Oracle DB
via DLPAR + 4 CPU Standby 1 CPU
with App Banner DB 1 CPU

Cluster2 PeopleSoft Acquired


Standby 1 CPU + 4 CPU via DLPAR
Financial DB 1 CPU with App

36
* Consolidated both Prod LPARs into one LPAR. Control separated by Resource Groups
Data Protection with PowerHA SM 7.1 & CAA
Enhanced Concurrent Mode Volume Groups are now required

ECM VGs were introduced in version 5.1


• Fast Disk Takeover
• Fast Failure Detection
• Disk heartbeating

Disk Fencing in CAA


• Fencing is automatic and transparent
• Cannot be turned off
• Fence group created by cl_vg_fence_init called from node_up

CAA Storage Framework fencing support


• Ability to specify level of disk access allowed by device driver
– Read/Write
– Read Only
– No Access (I/O is held until timeout)
– Fast Failure

37
Data Protection with PowerHA SM 7.1 & CAA
ECM Volume groups and the newly added protection

• LVM Enhanced Concurrent Mode VGs (Passive Mode)


• Prevent writes to logical volume or volume group devices
• Prevent filesystems from being mounted or any change requiring access to it

• CAA Fencing – prevents writes to the disk itself (ie. dd which runs below LVM level)

Node A Node B

/data/app read/write read only


CAA CAA
/data/db
Datavg /data No Access In the event
ECM VG Fail all I/Os of a failure
Cluster Cluster on node B
Services Services
ACTIVE PASSIVE
read/write read only

Shared
LUNs

38
Management Console: WebSMIT vs. IBM Director
CLI & SMIT sysmirror panels still the most common management interfaces

WebSMIT
• Available since HACMP 5.2
• Required web server to run on host until HACMP 5.5 (Gateway server)
• Did not fall in line with look and feel of other IBM offerings

IBM Systems Director Plug-in


• New for PowerHA
SystemMirror 7.1
• Only for management of
7.1 & above
• Same look and feel as IBM
suite of products
• Will leverage existing
Director implementation
• Uses clvt & clmgr CLI
behind the covers

39
WebSMIT Gateway Model: One-to-Many (6.1 & Below)

WebSMIT converted from a one-to-one architecture to one-to-many

User_1 User_2 User_3 User_4

Multiple WebSMIT users…


*One* WebSMIT server…
accessing multiple clusters…
managing multiple
through *one* WebSMIT
clusters…
server Standalone
WebSMIT
Server

Cluster_A Cluster_B Cluster_C

40
WebSMIT Screenshot: Associations Tab

41
PowerHA SystemMirror Cluster Management
New GUI User Interface for Version 7.1 Clusters

Three tier architecture provides scalability: User Interface


User Interface  Management Server  Director Agent  Web-based interface
 Command-line interface

Director Agent
 Automatically installed on AIX 7.1 & AIX V6.1 TL06 AIX

PowerHA Director
P D Agent

P D

P D
Secure communication
P D

Director Server
 Central point of control
P D
 Supported on AIX,
Linux, and Windows
P D  Agent manager
Discovery of clusters
and resources
P D
42
PowerHA SystemMirror Director Integration

• Accessing the SystemMirror Plug-ins

43
IBM Systems Director: Monitoring Status of Clusters
• Accessing the SystemMirror Plug-ins

44
PowerHA SystemMirror Configuration Wizards
• Wizards

45
PowerHA SystemMirror Smart Assistant Enhancements
Deploy HA Policy for Popular Middleware

46
PowerHA SystemMirror Detailed Views
• SystemMirror Management View

47
IBM Director: Management Dashboard

48
Do you know about clvt & clmgr ?

• “clmgr” announced in PowerHA SM 7.1


– clvt available since HACMP 5.4.1 for Smart Assists
– Hard linked “clmgr” to “clvt”
– Originally clmgr was intended for the Director team & rapidly evolved into a major,
unintended, informal line item.
– allows for deviation from clvt without breaking the Smart Assists

• From this release forward, only “clmgr” is supported for customer use
– clvt is strictly for use by the Smart Assists

• New Command Line Infrastructure


– Ease of Management
• Stop
• Start
• Move Resources

# clmgr on cluster  Start Cluster Services on all nodes

# clmgr sync cluster  Verify & Sync Cluster

# clmgr rg appAgroup node=node2  Move a resource group


49
Do you know about “clcmd” in CAA ?
Allows commands to be run across all cluster nodes

# lslpp -w /usr/sbin/clcmd
/usr/sbin/clcmd bos.cluster.rte

# clcmd lssrc -g caa # clcmd lspv


------------------------------- -------------------------------
NODE mutiny.dfw.ibm.com NODE mutiny.dfw.ibm.com
------------------------------- -------------------------------
Subsystem Group PID Status hdisk0 0004a99c161a7e45 rootvg active
clcomd caa 9502848 active caa_private0 0004a99cd90dba78 caavg_private active
cld caa 10551448 active hdisk2 0004a99c3b06bf99 None
clconfd caa 10092716 active hdisk3 0004a99c3b076c86 None
solid caa 7143642 active hdisk4 0004a99c3b076ce3 None
solidhac caa 7340248 active hdisk5 0004a99c3b076d2d None
------------------------------- -------------------------------
NODE munited.dfw.ibm.com NODE munited.dfw.ibm.com
------------------------------- -------------------------------
Subsystem Group PID Status hdisk0 0004a99c15ecf25d rootvg active
cld caa 4390916 active caa_private0 0004a99cd90dba78 caavg_private active
clcomd caa 4587668 active hdisk2 0004a99c3b06bf99 None
clconfd caa 6357196 active hdisk3 0004a99c3b076c86 None
solidhac caa 6094862 active hdisk4 0004a99c3b076ce3 None
solid caa 6553698 active hdisk5 0004a99c3b076d2d None

50
PowerHA SystemMirror: Sample Application Monitor

# cat /usr/local/hascripts/ora_monitor.sh
#!/bin/ksh
ps –ef | grep ora_pmon_hatest
51
PowerHA SystemMirror: Pager Events

HACMPpager:
methodname = "Herrera_notify"
desc = “Lab Systems Pager Event"
nodename = "connor kaitlyn"
dialnum = "mherrera@us.ibm.com"
filename = "/usr/es/sbin/cluster/samples/pager/sample.txt"
eventname = "acquire_takeover_addr config_too_long event_error
node_down_complete node_up_complete"
retrycnt = 3
timeout = 45

# cat /usr/es/sbin/cluster/samples/pager/sample.txt
Node %n: Event %e occurred at %d, object = %o

 Action Taken: Halted Node Connor

Sample Email:

From: root 09/01/2009 Subject: HACMP


Node kaitlyn: Event acquire_takeover_addr occurred at Tue Sep 1 16:29:36 2009, object =

Attention:
Sendmail must be working and accessible via the firewall to receive notifications
52
PowerHA SystemMirror Tunables

• AIX I/O Pacing (High & Low Watermark)


– Typically only enable if recommended after performance evaluation
– Historical values 33 & 24 have been updated to 513 & 256 on AIX 5.3 and 8193 &
4096 on AIX 6.1
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/disk_io_pacing.htm

• Syncd Setting
– Default value of 60  recommended change to 10

• Failure Detection Rate (FDR) – only for Version 6.1 & below
– Normal Settings should suffice in most environments (note that it can be tuned further)
– Remember to enable FFD when using disk heartbeating

• Pre & Post Custom EVENTs


– Entry points for notifications or actions required before phases in a takeover

53
PowerHA SystemMirror: Testing Best Practices

• Test Application scripts and Application monitors thoroughly


– Common problems include edits to scripts within scripts

• Test fallovers in all directions


– Will confirm start & stop scripts on both locations

• Test Cluster
– Lpars within same frame
– Virtual resources

• Utilize Available Tools – Cluster Test Tool

• Testing upgrades “Alternate disk install” is your friend

Best Practice:
Testing should be the foundation for your documentation in the event that someone not
PowerHA savvy is there when a failure occurs.

54
How to be successful with PowerHA SystemMirror

• Strict Change Controls


– Available test environment
– Testing of changes

• Leverage CSPOC functions


– Create / Remove / Change - VGs, LVs, Filesystems
– User Administration

• Know what to look for


– cluster.log / hacmp.out / clstrmgr.debug log files
– /var/hacmp/log/clutils.log  Summary of nightly verification
– /var/hacmp/clverify/clverify.log  detailed verification output

munited /# cltopinfo -m
Interface Name Adapter Total Missed Current Missed
Address Heartbeats Heartbeats
--------------------------------------------------------------------------------------------------------------------
en0 192.168.1.103 0 0
rhdisk1 255.255.10.0 1 1

Cluster Services Uptime: 30 days 0 hours 31 minutes

55
Summary

• Review your infrastructure for potential single points of failure


– Be aware of the potential pitfalls listed in the common mistakes slide

• Leverage Features like:


– File Collections
– Application monitoring
– Pager Notification Events

• Keep up with feature changes in each release


– New dependencies & fallover behaviors

• Virtualizing P7 or P6 environments is the foundation for Live Partition Mobility


– NPIV capable adapters can help simplify the configuration & management

• WebSMIT & IBM Director are the available GUI front-ends


– The cluster release will determine which one to use

56
Learn More About PowerHA SystemMirror

PowerHA SystemMirror IBM Portal

Popular Topics:

* Frequently Asked Questions

* Customer References

* Documentation

* White Papers

http://www-03.ibm.com/systems/power/software/availability/aix/index.html
( … or Google ‘PowerHA SystemMirror’ and click I’m Feeling Lucky)

57
Questions?

Thank you for your time!


58
Additional Resources

• New - Disaster Recovery Redbook


SG24-7841 - Exploiting PowerHA SystemMirror Enterprise Edition for AIX
http://www.redbooks.ibm.com/abstracts/sg247841.html?Open

• New - RedGuide: High Availability and Disaster Recovery Planning: Next-Generation


Solutions for Multi server IBM Power Systems Environments
http://www.redbooks.ibm.com/abstracts/redp4669.html?Open

• Online Documentation
http://www-03.ibm.com/systems/p/library/hacmp_docs.html

• PowerHA SystemMirror Marketing Page


http://www-03.ibm.com/systems/p/ha/

• PowerHA SystemMirror Wiki Page


http://www-941.ibm.com/collaboration/wiki/display/WikiPtype/High+Availability

• PowerHA SystemMirror (“HACMP”) Redbooks


http://www.redbooks.ibm.com/cgi-bin/searchsite.cgi?query=hacmp

59

You might also like