Session from worldwide software architecture summit on designing a platform agnostic high availability system
Report
Share
Report
Share
1 of 26
Download to read offline
More Related Content
Designing A Platform Agnostic HA System
1. Designing a platform agnostic
High Availability (HA) system
Runcy Oommen
Mar 01 | Worldwide Software Architecture Summit '22
2. | The HA Agenda |
• Definition and necessity for HA
• Types of HA system
• Channel establishment (also PUB-SUB style)
• Crossover - Heartbeat implementation
• Critical service monitoring
• Probing and networking overview
• Pre-emption concept in a HA system
• Floating IP to determine "active" node
• Design gotchas & pitfalls
3. Career
• Principal SDE, SonicWall, 18+ years industry exp primarily
in systems, cloud (private & public), security, networking
• 10x multi-cloud certified
• Special interest in serverless, containers and cloud-native
offerings. Firm believer of multi-hybrid cloud
Community
• Organizer of GDG Cloud and Cloud Native meetup
groups in Bangalore
• Regular speaker at domestic & international cloud, tech
& security conferences
• Multiple hackathon wins in cloud/security topics
• Recognized by Google as a community influencer
4. What does "High Availability" mean?
A system characteristic, that aims to ensure an
agreed level of operational performance -
usually uptime - for a higher-than-normal period
Principles of systems design to help achieve HA:
1.Eliminating single point of failure - building redundancy
2.Reliable crossover - continues to deliver the functionality
3.Detection of failures as they occur
Reference:
https://en.wikipedia.org/wiki/High_availability
5. Necessity for
HA design
• A critical point of any infrastructure/platform
•With increased adoption in remote work,
it's imperative to provide enhanced availability
•It's architecture independent – monolith
or micro-services
•It's a trendy topic in modern software
design/development
6. Types of HA system
ACTIVE-STANDBY
IMPLEMENTATION
ACTIVE-ACTIVE
IMPLEMENTATION
@runcyoommen
8. Importance of
the HA channel
Acts as THE communication link
between the nodes
Performs an important heart-beat
monitoring mechanism
Main control plane for data, status
and command exchange
Link to be established via a reliable
stack (e.g. TCP/IP)
@runcyoommen
10. • Fast and easy to implement, if custom logic is
not a hard requirement
PUBlisher <> SUBscriber model
•A publisher and subscriber mechanism to be
present at each node
•Essentially the same thing as control channel –
events handled by appropriate topics
•Many established language specific libraries
and message queues exist
•Rabbit MQ, Centrifugo, Google Pub/Sub etc...
11. Crossover - Heartbeat mechanism
• Determines the overall state of the
HA cluster
@runcyoommen
•Handles crossover logic in cases
like network loss, power outage, system
crash, host corruption etc...
•Provides certain threshold checks and
failure buffers against false alarms
12. Critical service
monitoring
Maintain a roster of important processes
Provides appropriate weightage to each service
Perform failover depending on overall service status
Threshold mechanism for self-recovery
Implement a watchdog daemon for retries
@runcyoommen
14. Probing and networking overview in HA
Need for continuous probe
to a hostname or IP address
Specified at certain interval
lasting for a certain duration
Check for failure buffers
to avert false positives
15. Probing functionality
External IP/hostname – Determines
internet access
@runcyoommen
Internal host – Implementation of
a closed network
3rd party software dependency for
library or SDK
17. Floating IP - Concept
• A virtual addressing mechanism used in a HA cluster that moves
between devices - also known as Virtual IP
@runcyoommen
• Determines the Active node in the event of a link or device failure
• Usually configured in the same subnet space as the physical interface
18. Floating IP - Implementation
• Logic determined by the highest node priority – there could be
other criteria as well
Node 2
Priority: 100
(Standby)
Node 1
Priority: 200
(Active)
•The Virtual address binds the MAC address associated with the physical
interface of node with highest priority
•Example package/software - Keepalived, HAProxy etc...
19. HA concept - Pre-emption
• Essentially means that the Primary
takes back the 'Active' role on
recovery
• Always maintain initial status-quo
• Caution: Might exhibit aggressive
behavior leading to untoward
overall HA status
@runcyoommen
24. HA design
gotchas &
pitfalls to avoid
• HA config storage
oNever store data about the config in a DB
oStatus should be node dependent
@runcyoommen
•Split-brain condition
oEach node thinking it to be the Active one
oLeads to channel failure and config
sync issues
•Huge design difference for Active-Standby &
Active-Active deployment
25. High Availability – Key takeaways
•The right architecture – Huge difference between
Active-Standby & Active-Active implementation
•Robust channel – Paramount for the control and
overall state of the cluster
•Failover & Heartbeat – Ensure deep tests to finalize
the code and design. Never stop iterating/fine-tuning
•Reliable 3rd party package – Helps offload your
priority determination to get the 'Active' node