Designing A PowerHA SystemMirror For AIX High Availability Solution - HA17 - Herrera
Designing A PowerHA SystemMirror For AIX High Availability Solution - HA17 - Herrera
Designing A PowerHA SystemMirror For AIX High Availability Solution - HA17 - Herrera
Session Title:
Designing a PowerHA SystemMirror for AIX High Availability Solution
+
Workload-Optimizing Systems
Agenda
• Infrastructure Considerations
• Differences in 7.1
• Licensing Scenarios
• Summary
3
HACMP is now PowerHA SystemMirror for AIX!
HA & DR solutions from IBM for your mission-critical AIX applications
• Packaging Changes:
– Standard Edition - Local Availability
– Enterprise Edition - Local & Disaster Recovery
• Licensing Changes:
– Small, Medium, Large Server Class
Product Lifecycle:
Version Release Date End of Support Date
HACMP 5.4.1 Nov 6, 2007 Sept, 2011
PowerHA 5.5.0 Nov 14, 2008 N/A
PowerHA SystemMirror 6.1.0 Oct 20, 2009 N/A
PowerHA SystemMirror 7.1.0 Sept 10, 2010 N/A
* These dates are subject to change per Announcement Flash
4
PowerHA SystemMirror Minimum Requirements
HACMP 5.4.1
• AIX 6.1 with RSCT version 2.5.0.0 or higher 5.4.1.8 – May 13
• AIX 5.3 TL4 with RSCT version 2.4.5 (IY84920) or higher
• AIX 5.2 TL8 with RSCT version 2.3.9 (IY84921) or higher
5
Common Misconceptions
Fact:
Clustering will highlight what you are & are NOT doing right in your environment
6
Common Mistakes Beyond Base Cluster Functionality
Solutions:
• IBM Training / Redbooks / Proof of Concepts / ATS Health-check Reviews
7
Identify & Eliminate Points of Failure
• LAN Infrastructure
– Redundant Switches
• SAN Infrastructure
– Redundant Fabric
• Application Availability
– Application Monitoring
– Availability Reports
8
Infrastructure Considerations
LAN LAN
SAN SAN
DWDM DWDM
Node A Node B
50GB 50GB
50GB 50GB
50GB 50GB
Important:
Identify & Eliminate Single Points of Failure!
9
Infrastructure Considerations
net_ether_0
LAN LAN
SAN SAN
DWDM DWDM
Node A Node B
50GB 50GB
50GB 50GB
50GB 50GB
Important:
Identify Single Points of Failure & design the solution around them 10
Infrastructure Considerations
• Power Redundancy
Real Customer Scenarios:
• I/O Drawers Ie 1. Two nodes sharing I/O drawer
1 I/O drawer 6
I/O drawer
• SCSI Backplane 2 7
3 I/O drawer 8
4 I/O drawer 9
• SAN HBAs 5 I/O drawer 10
11
PowerHA SystemMirror 7.1: Topology management
Heartbeating differences between earlier cluster releases
Multicasting
diskhb_net4
LPAR 3
diskhb_net3 LPAR 3
Network: Net_ether0
VLAN HB Rings
9.19.51.21 (service IP) In 6.1 & below
192.168.101.1 (base address) en1 en1 ( base address) 192.168.101.2
Traditional heartbeating rules no longer apply. However, route stripping is still a potential issue. When
two interfaces have routable IPs on the same subnet AIX will send half the traffic out of either interface
13
PowerHA SM 7.1: Additional Heartbeating Differences
Heartbeating:
• Self Tuning Failure Detection Rate (FDR)
• All interfaces are used even if not in cluster networks
en3 en3
9.19.51.21 (service IP)
9.19.51.20 (service IP)
9.19.51.10 (base address) en2 en2 ( base address) 9.19.51.11
VLAN
What is it:
• A set of services/tools embedded in AIX to help manage a cluster of AIX
nodes and/or help run cluster software on AIX
• IBM cluster products (including RSCT, PowerHA, and the VIOS) will use
and/or call CAA services/tools
• CAA services can assist in the management and monitoring of an arbitrary
set of nodes and/or running a third-party cluster
Major Benefits:
• Enhanced Health Management (Integrated Health Monitoring)
• Cluster Wide Device Naming
15
Cluster Aware AIX Exploiters
IBM
DB2
Director
RSCT Consumers
IBM PowerHA
TSA HMC HPC VIOS
Storage System Mirror
Group Services Resource Mgr Services Group Services Resource Mgr Services
• RSCT and Cluster Aware AIX together provide the foundation of strategic Power Systems SW
• RSCT-CAA integration enables compatibility with a diverse set of dependent IBM products
• RSCT integration with CAA extends simplified cluster management along with optimized and robust
cluster monitoring, failure detection, and recovery to RSCT exploiters on Power / AIX
16
Cluster Aware AIX: Central Repository Disk
Contrast from previous releases
17
Multi Channel Health Management – Out of the Box
Hardened Environments with new communication protocol
Reliable Reliable
Heartbeats Heartbeats
Messaging Messaging
Highlights:
• RSCT Topology services no longer used for cluster Heartbeating
• All customers now have multiple communication paths by default
18
Basic Cluster vs. Advanced Cluster Features
Virtualization
Frame 1 Frame 2
Node A NIC NIC Node B
{ }
HBA
HBA
rootvg hdisk2 VIO VIO hdisk1 rootvg
SAN SAN
NIC
{
NIC
oracle_vg1 }
HBA
hdisk4
HBA
hdisk4 oracle_vg1
VIO VIO
21
Storage Subsystem
Power Virtualization & PowerHA SystemMirror
Power HA Cluster
• LPAR / DLPAR
Power HA_node 1 Power HA_node 2
LPAR X LPAR Y LPAR Z
AIX
Rootvg
Data AIX
Rootvg
Data • Micropartitioning &
en0 vfc0 vfc1 en0 vfc0 vfc1
Shared Processor Pools
22
PowerHA SystemMirror Virtualization Considerations
• Ethernet Virtualization
– Topology should look the same as environment using link aggregation
– Version 7.1 no longer uses netmon.cf file
– As a best practice dual VIO Servers are recommended
• SEA Fallover Backend
• Storage Virtualization
– Both methods of virtualizing storage are supported
• VSCSI vs. Virtual fiber (NPIV)
– In DR implementations leveraging disk replication consider the
implications of using either option
• Benefits of virtualization:
– Maximize utilization of resources
– Less PCI slots & physical adapters
– Foundation for advanced functions like Live Partition Mobility
– Migrations to newer Power Hardware are simplified
* Live Partition Mobility & PowerHA SM compliment each other Chapter 2.4 PowerVM
Maintenance vs. High Availability Virtualization Considerations
(non-reactive . reactive)
23
Virtual Ethernet & PowerHA SystemMirror
No Link Aggregation / Same Frame
Virtual I/O Server (VIOS1) PowerHA LPAR 1 PowerHA LPAR 2 Virtual I/O Server (VIOS2)
ent4
ent4 en0 en0 en6 (SEA)
(SEA) en6
Control
Control Channel
Channel
PVID 99
PVID 10
Hypervisor
Frame 1
This is a diagram of the configuration required for SEA fallover across VIO Servers. Note
that Ethernet traffic will not be load balanced across the VIO Servers. The lower trunk
priority on the “ent2” virtual adapter would designate the primary VIO Server to use.
24
Virtual Ethernet & PowerHA SystemMirror
Independent Frames & Link Aggregation
Virtual I/O Server (VIOS1) PowerHA LPAR 1 Virtual I/O Server (VIOS2)
Frame1 ent1 ent0 ent2 ent5 ent0 ent5 ent2 ent1 ent0
(phy) (phy) (virt) (virt) (virt) (virt) (virt) (phy) (phy)
Hypervisor
Hypervisor
Virtual I/O Server (VIOS1) PowerHA LPAR 2 Virtual I/O Server (VIOS2)
25
PowerHA SystemMirror 6.1 & Below
net_ether_0
9.19.51.10 9.19.51.11
( base address) Topsvcs heartbeating ( base address)
Hypervisor
Virtual I/O Server (VIOS1) AIX Client LPAR Virtual I/O Server (VIOS2)
Differences:
• RSCT Topology services is no longer used for heartbeat monitoring
• Subnet Requirements no longer need to be followed
• Netmon.cf file is no longer required or used
• All interfaces are used for monitoring even if they are not in an HA network
(this may be tunable in a future release)
IGMP Snooping must be enabled on the switches
27
VSCSI Mapping vs. NPIV (virtual fiber)
FRAME 1
VIOS 1 Node 1
NPIV
HBA
hdisk0 } rootvg
vscsi0
HBA
vhost0
Hypervisor
hdisk
MPIO
vscsi1
hdisk1
hdisk2
} vscsi_vg
HBA
LUNS
VSCSI
VIOS 2
NPIV
HBA
fcs1
MPIO hdisk3
hdisk4
} npiv_vg
LUNS
NPIV
FRAME 2
STORAGE
SUBSYSTEM VIOS 1 Node 2
NPIV
HBA
vscsi0
hdisk0 } rootvg
HBA
Hypervisor
vhost0 MPIO
}
hdisk hdisk1
vscsi1
vscsi_vg
hdisk2
HBA
VIOS 2
NPIV
HBA
fcs1
MPIO hdisk3
hdisk4
} npiv_vg
28
Live Partition Mobility Support with IBM PowerHA
How does it all work?
The two solutions compliment each other by providing the ability to perform non-disruptive
maintenance while retaining the ability to fallover in the event of a system or application outage
29
PowerHA and LPM Feature Comparison
Live
PowerHA Partition
SystemMirror Mobility
Live OS/App move between physical frames*
Energy Management**
Hardware Maintenance
Software Maintenance
30
PowerHA SystemMirror: DLPAR Value Proposition
Pros: Cons:
• Automated action on acquisition of • Requires Connectivity to HMC
resources (bound to the PowerHA
application server) • Potentially Slower Failover
System Specs:
– 32-way (2.3GHz) Squad-H+
• HMC Verification Checking for – 256GB of memory
connectivity to the HMC
Results:
– 120GB DLPAR add took 1min 55 sec
– 246GB DLPAR add took 4 min 25 sec
• Ability to Grow LPAR on Failover – 30% busy running artificial load the add took
4 minutes 36 seconds
LPAR A LPAR B
HMC HMC
DLPAR CPU Count
Minimal CPU Count Minimal CPU Count
Application Server
Backup 31
DLPAR Licensing Scenario
How does it all work?
System A System B
Cluster 1
Acquired + 1 CPU Oracle DB 1 CPU Standby 1 CPU
via DLPAR
Cluster 2
with App + 2 CPU Banner DB 1 CPU Standby 1 CPU
Cluster 3
Standby 1 CPU PeopleSoft 1 CPU + 1 CPU Acquired
Cluster 4 via DLPAR
Standby 1 CPU Financial DB 1 CPU + 2 CPU with App
Production PeopleSoft 2 8 GB
Banner Financial DB 3 32 GB
Production Financial DB 3 32 GB
32
Environment: PowerHA App Server Definitions
Application Server
Min 1
System A System B
Desired 2 Cluster 1
Oracle DB 1 CPU Standby 1 CPU
The actual application
Cluster 2
Application Server Banner DB 1 CPU Standby 1 CPU requirements are stored in the
Min 1
Cluster 3
PowerHA SystemMirror
Desired 3 Standby 1 CPU PeopleSoft 1 CPU definitions and enforced during
Cluster 4 the acquisition or release of
Standby 1 CPU Financial DB 1 CPU
application server resources
HMC
33
Environment: DLPAR Resource Processing Flow
HMC
DLPAR DLPAR
System A System B
Cluster 1
- 1 CPU + 1 CPU Oracle DB 21 CPU Standby
Oracle DB 12CPU
CPU + 1 CPU - 1 CPU
Cluster 2
- 2 CPU + 2 CPU Banner DB 31 CPU Standby
Banner DB 1 3CPU
CPU + 2 CPU - 2 CPU
Take Aways:
• CPU allocations follow the application server wherever it is being hosted (this model allows you to lower the HA license count)
• DLPAR resources will only get processed during the acquisition or release of cluster resources
• PowerHA 6.1+ allows provide micro-partitioning support and the ability to also alter virtual processor counts
• DLPAR resources can come from free CPUs in shared processor pool or CoD resources 34
PowerHA SystemMirror: DLPAR Value Proposition
System A System B
Cluster 1
Oracle DB 2 CPU Standby 2 CPU
PowerHA license counts:
Cluster 2
Banner DB 3 CPU Standby 3 CPU Cluster 1 : 4 CPUs
Cluster 3
Cluster 2 : 6 CPUs
Standby 2 CPU PeopleSoft 2 CPU Cluster 3 : 4 CPUs
Cluster 4 Cluster 4 : 6 CPUs
Standby 3 CPU Financial DB 3 CPU
Total : 20 licenses
HMC
PowerHA license counts:
Environment using DLPAR model Cluster 1 : 3 CPUs
Cluster 2 : 4 CPUs
System A System B Cluster 3 : 3 CPUs
Cluster 1
Cluster 4 : 4 CPUs
Acquired + 1 CPU Oracle DB 1 CPU Standby 1 CPU Total : 14 licenses
via DLPAR
Cluster 2
with App + 2 CPU Banner DB 1 CPU Standby 1 CPU
Cluster 3
Standby 1 CPU PeopleSoft 1 CPU + 1 CPU Acquired
Cluster 4 via DLPAR
Standby 1 CPU Financial DB 1 CPU + 2 CPU with App
35
PowerHA SystemMirror: DLPAR Modified Model
HMC
Environment using DLPAR model PowerHA license counts:
* Same as previous slide Cluster 1 : 3 CPUs
Cluster 2 : 4 CPUs
System A System B Cluster 3 : 3 CPUs
Cluster 1
Cluster 4 : 4 CPUs
Acquired + 1 CPU Oracle DB 1 CPU Standby 1 CPU Total : 14 licenses
via DLPAR
Cluster 2
with App + 2 CPU Banner DB 1 CPU Standby 1 CPU
Cluster 3
Standby 1 CPU PeopleSoft 1 CPU + 1 CPU Acquired
Cluster 4 via DLPAR
Standby 1 CPU Financial DB 1 CPU + 2 CPU with App
HMC
36
* Consolidated both Prod LPARs into one LPAR. Control separated by Resource Groups
Data Protection with PowerHA SM 7.1 & CAA
Enhanced Concurrent Mode Volume Groups are now required
37
Data Protection with PowerHA SM 7.1 & CAA
ECM Volume groups and the newly added protection
• CAA Fencing – prevents writes to the disk itself (ie. dd which runs below LVM level)
Node A Node B
Shared
LUNs
38
Management Console: WebSMIT vs. IBM Director
CLI & SMIT sysmirror panels still the most common management interfaces
WebSMIT
• Available since HACMP 5.2
• Required web server to run on host until HACMP 5.5 (Gateway server)
• Did not fall in line with look and feel of other IBM offerings
39
WebSMIT Gateway Model: One-to-Many (6.1 & Below)
40
WebSMIT Screenshot: Associations Tab
41
PowerHA SystemMirror Cluster Management
New GUI User Interface for Version 7.1 Clusters
Director Agent
Automatically installed on AIX 7.1 & AIX V6.1 TL06 AIX
PowerHA Director
P D Agent
P D
P D
Secure communication
P D
Director Server
Central point of control
P D
Supported on AIX,
Linux, and Windows
P D Agent manager
Discovery of clusters
and resources
P D
42
PowerHA SystemMirror Director Integration
43
IBM Systems Director: Monitoring Status of Clusters
• Accessing the SystemMirror Plug-ins
44
PowerHA SystemMirror Configuration Wizards
• Wizards
45
PowerHA SystemMirror Smart Assistant Enhancements
Deploy HA Policy for Popular Middleware
46
PowerHA SystemMirror Detailed Views
• SystemMirror Management View
47
IBM Director: Management Dashboard
48
Do you know about clvt & clmgr ?
• From this release forward, only “clmgr” is supported for customer use
– clvt is strictly for use by the Smart Assists
# lslpp -w /usr/sbin/clcmd
/usr/sbin/clcmd bos.cluster.rte
50
PowerHA SystemMirror: Sample Application Monitor
# cat /usr/local/hascripts/ora_monitor.sh
#!/bin/ksh
ps –ef | grep ora_pmon_hatest
51
PowerHA SystemMirror: Pager Events
HACMPpager:
methodname = "Herrera_notify"
desc = “Lab Systems Pager Event"
nodename = "connor kaitlyn"
dialnum = "mherrera@us.ibm.com"
filename = "/usr/es/sbin/cluster/samples/pager/sample.txt"
eventname = "acquire_takeover_addr config_too_long event_error
node_down_complete node_up_complete"
retrycnt = 3
timeout = 45
# cat /usr/es/sbin/cluster/samples/pager/sample.txt
Node %n: Event %e occurred at %d, object = %o
Sample Email:
Attention:
Sendmail must be working and accessible via the firewall to receive notifications
52
PowerHA SystemMirror Tunables
• Syncd Setting
– Default value of 60 recommended change to 10
• Failure Detection Rate (FDR) – only for Version 6.1 & below
– Normal Settings should suffice in most environments (note that it can be tuned further)
– Remember to enable FFD when using disk heartbeating
53
PowerHA SystemMirror: Testing Best Practices
• Test Cluster
– Lpars within same frame
– Virtual resources
Best Practice:
Testing should be the foundation for your documentation in the event that someone not
PowerHA savvy is there when a failure occurs.
54
How to be successful with PowerHA SystemMirror
munited /# cltopinfo -m
Interface Name Adapter Total Missed Current Missed
Address Heartbeats Heartbeats
--------------------------------------------------------------------------------------------------------------------
en0 192.168.1.103 0 0
rhdisk1 255.255.10.0 1 1
55
Summary
56
Learn More About PowerHA SystemMirror
Popular Topics:
* Customer References
* Documentation
* White Papers
http://www-03.ibm.com/systems/power/software/availability/aix/index.html
( … or Google ‘PowerHA SystemMirror’ and click I’m Feeling Lucky)
57
Questions?
• Online Documentation
http://www-03.ibm.com/systems/p/library/hacmp_docs.html
59