Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

High Availability (HA) Choosing A Solution

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

High Availability (HA)

Choosing a Solution
Summary
What to look for when choosing a
high-availability solution?
This white paper is based on a case
study: a crisis in an airport.

03  Case study: a crisis in an airport

05  More comparison on evidian.com

05  3 partner testimonials

06  10 reasons to choose the SafeKit clustering software

02 - High Availability (HA) Choosing a Solution


Case study: a crisis in an airport
Every computer-system-based activity, The crisis but the solution was incomplete in terms
both big and small, is one day or the other of high-availability and recovery in case of
In the middle of the week, the main server
faced with the problem of computer failure. failure. Yet in its technical data sheet, this
rebooted automatically as a result of an
Unfortunately, the day the failure occurs, solution is said to have high-availability
error. Unfortunately, it remained blocked
a small problem may turn into a general mechanisms with heartbeat and recovery
in its booting process and failed to reboot
crisis due to a succession of errors. This scripts!
the traveler-management application. The
was exactly what happened at an airport, Therefore, it is necessary to properly
notice boards at the airport went blank, and
despite the fact that it was equipped with identify the high-availability needs while
the IT department was alerted.
a redundant solution. This story will enable choosing a replication solution. The needs
us to highlight the main characteristics of a Since the main server was blocked in its
expressed by the production team after this
good-quality high-availability solution. booting process, a decision was taken to
crisis at the airport are as follows.
restart the application on the backup server.
The crisis occurred at an airport on the
traveler-management application. This
The application was then restarted on the Every administrator must know
application displays information on the
backup server without any difficulties and, that replication has stopped on
airport’s notice boards, making it possible
a few minutes later, all the notice boards at the backup server
the airport became active again. But they The first event that led to the crisis at the
for travelers to locate their boarding gate or
displayed to travelers the flight schedules airport was the updating of flight schedules
to know where to retrieve their luggage.
for the previous week! on the main server on Sunday while the
High availability based on The red alert at the IT department soon backup server was out of service.
redundancy turned into a black alert. The traveler- On Monday, when the entire production
The size of the traveler-management management application was immediately team arrives, a simple glance at the
database is not big. But this database is stopped on the backup server, and the IT replication application’s administration
extremely sensitive and must withstand the team decided to solve the problem of main console should be enough to detect the
worst case scenario at an airport: aircraft server reboot. Meanwhile, the airport’s absence of replication on the backup server.
crash-landing on a computer room. information-management application Therefore, the replication solution must
The selected redundancy solution consisted remained unavailable to all travelers. have an administration console which can
of two separate servers located in two The situation gets worse remotely connect to servers and provide a
remote computer rooms. Each server had summary of the status of a high-availability
a local traveler-management database, A few hours later, the administrators application on two servers. It must equally
and a replication tool was used to ensure succeeded in solving the problem on the be possible to send mail and to integrate
database redundancy between the main main server, and in rebooting this latter. the product in the administration console
server and the backup server. But, unfortunately, in the booting phase, the used by the customer if replication stops.
replication tool on the main server started
The incident automatically. It detected that replication It should be very easy for every
To maintain the database with the new was already active on the backup server administrator to restart
and resynchronized the main server’s local replication on the backup
flights for one week, every Sunday the
database from the backup server. The server
database is updated with the flight
schedules for the following week. database of both servers thus contained On Monday, upon detection of replication
the flight schedules for the previous week! stop on the backup server, any
Unfortunately, one Sunday, while the flight
update operation was in progress on the Airport activities were disrupted for a administrator from the IT department
main server, replication stopped on the whole day, and the main server could only should be able to reboot the backup server:
be correctly resynchronized through the • either by clicking a button on the
backup server. So, at the end of the update
special Sunday operation. replication administration console, or
operation, the main server contained the
In the end, the airport replaced the • through a very simple online command
flights for the following week, and the offered by the replication solution on the
backup server the flights for the previous replication tool with a complete replication
backup server, or
week. and high-availability solution, based on the
• by rebooting the backup server.
On Monday, the main server supplied the criteria described hereinafter.
Restart should never be reserved to a
right flight schedules to the airport’s notice Where are the vulnerabilities of replication solution expert. In fact, the
boards. For the IT department, airport a high availability solution? system must be made highly available
management was working properly, despite again even when the expert is not present.
The problem encountered by the airport’s
the undetected absence of replication on
IT department was due to the replication
the backup server.
solution chosen. Replication was working,

High Availability (HA) Choosing a Solution - 03


It should be possible to In fact, in case of failure with an It must be possible to combine
resynchronize the backup asynchronous replication solution, you high availability with disaster
server while the application is have to locate the passengers registered recovery
running on the main server for a flight prior to the failure, but whose
High availability requires the presence of a
On Monday, when the IT department details are not saved in the backup server
synchronous replication system between
resumes work, the passenger-management database due to asynchronous replication.
two servers. Both servers must be placed
application at the airport is running on These passengers’ reservation is lost after
on the same LAN for two reasons. The first
the main server, and replication must be the failure, and their seats are again free in
reason is that the LAN’s bandwidth and
restarted as quickly as possible on the the reservation system!
latency (example: 1 Gb/s) determines the
backup server. It should, therefore, be Data replication on the backup synchronous replication performances.
possible to resynchronize the backup- server must be synchronous to Secondly, having two servers on the same
server database once the main server is ensure high availability LAN ensures service IP address switchover
accessed. when the application is restarted on the
With a synchronous replication solution,
Popular file replication solutions cannot backup server.
data is not lost in case of server failure. In
resynchronize a backup server without Disaster recovery and high availability
fact, at the check-in counter, a passenger
stopping the application on the main can be ensured simply by extending the
is checked in by reserving a seat for him
server! These products implement same LAN with fiber optic cable in two
or her on the aircraft. The reservation
replication solutions but absolutely not geographically remote computer rooms.
request is sent to the main server, and
high-availability solutions. Thus, it is possible to simultaneously ensure
the passenger-management application
application high availability and disaster
If a server’s local databases are permanently stores the reservation-related
recovery with two servers in two computer
not updated, the solution must, information on the local disk.
rooms and a real-time data replication
by default, refuse to run the In synchronous replication mode, when a
application on the non-updated solution across the network.
piece of information is permanently stored
server When data must be replicated via a low-
on a server's local database, it is also stored speed WAN data replication must be
In the case of the airport, the alert turned on the remote database. asynchronous, but the asynchronous
into a crisis because this feature did not Thus, when the check-in counter receives replication system does not offer high
exist. the acknowledgement of reservation and availability: data is lost in case of failure.
Asynchronous replication must be
The backup server with the flight schedule the attendant releases the passenger by
compared to a near real time system of
for the previous week was not up-to-date, issuing him or her a boarding pass, the remote backup on tape via the network and
still nothing stopped the application from reservation is permanently recorded in both not to a high-availability system.
starting on this server. servers’ databases (this is not the case with
asynchronous replication). Does the high availability
Later, when the main server was rebooted, solution resists to "split brain"
data synchronization took place in the If the main server fails, the passengers
without corrupting data?
wrong direction in that both servers were currently being checked in are put on hold
Split brain occurs in situation of network
synchronized with the flights for the because the attendants can no longer
isolation between two servers. Each server
previous week, thus resulting directly in the contact the reservation application. Once becomes primary considering that the
crisis at the airport. the application is restarted on the backup other has failed and runs the application.
Recovery-control mechanisms must enable server, the attendants can restart checking During split brain, some high-availability
in the passengers on standby. Still the solutions can corrupt database with two
the administrator to avoid human error, active applications on the same database.
which a simple replication solution does not reservations for the passengers checked
Moreover you have to check that at return
offer. in prior to the failure are not lost since of split brain, a sacrifice is properly executed
they had been saved in the backup-server by the high availability solution by stopping
In case of failure, replication database. the application on one of the two servers.
should not result in the loss of Or better, it can be necessary to avoid the
data about already registered double execution by testing an external
travelers network equipment acting as a witness
between the two servers.
This important requirement is not met with
the asynchronous replication traditionally
implemented by file replication solutions!
Take care and check that the replication
solution is synchronous.

04 - High Availability (HA) Choosing a Solution


Thanks to this analysis using the example of the airport, it is now clear that a simple data replication
solution is not a high-availability solution. We have seen the pitfalls in case of server failure and we advise
the reader to check that a replication solution actually meets the high-availability criteria that we have just
explained.

Many products make out that they perform data replication and offer high availability of applications,
whereas in reality they only implement data replication and are very incomplete in terms of recovery in
case of failure. With this type of product, an IT department insidiously believes it has a redundant high-
availability solution. But then it discovers the product’s limits the day a problem, such as a server blocking in
its booting process, turns into a generalized crisis, with a day of service unavailability for an entire airport!

SafeKit is the ideal solution for high availability of critical applications. It is a comprehensive and simple
product. It was chosen after the crisis at the airport and has advantageously replaced the competitor's
solution that had led to the critical situation described in this document.

More comparison on evidian.com


• Software clustering vs hardware clustering
• Shared nothing vs shared disk cluster
• Virtual machine HA vs application HA
• Byte-level file replication vs block-level disk replication
• Synchronous replication vs asynchronous replication
• Alternative to Microsoft NLB with SafeKit network load balancing in VMware

3 partner testimonials
1.  The ideal product for a software publisher
Testimonial of Philippe Vidal, Product Manager, Harmonic, Broadcasting:

“SafeKit is the ideal application clustering solution for a software publisher looking for a simple and economical high availability software.
We currently have more than 80 SafeKit clusters worldwide on Windows with our critical TV broadcasting application through terrestrial,
satellite, cable and IP-TV. SafeKit implements the continuous and real-time replication of our database as well as the automatic failover of our
application for software and hardware failures. Without modifying our application, it was possible for us to customize the installation of
SafeKit. Since then, the time of preparation and implementation has been significantly reduced.”

2.  The product very easy to deploy for a reseller


Testimonial of Tommy Park, CEO, WithNCompany, South Korea:

“WithNCompany has deployed in South Korea many SafeKit high availability solutions with the Samsung Video Surveillance Platform and
with the SSM application. SafeKit is appreciated because the product is easy to install and very quickly deployed. The SSM application does
not need to be modified to run in the high availability mode. The SSM application can be installed on the default C: drive and there is no
need to configure a separate disk volume. SafeKit is able to make real-time replication of SSM folders inside the C: drive. Moreover, for cost
saving, SafeKit can be installed on 2 Windows PCs instead of 2 servers.”

3.  The product to gain time for a system integrator


Testimonial of Philippe Marsol, Integration Manager, Atos, BU Transport:

“SafeKit is a simple and powerful product for application high availability. We have integrated SafeKit in our critical projects like the
supervision of Paris and Marseille metro lines (CCR, centralized control rooms). Thanks to the simplicity of the product, we gained time for
the integration and validation of the solution and we had also quick answers to our questions with a responsive Evidian team.”

High Availability (HA) Choosing a Solution - 05


10 reasons to choose the SafeKit clustering
software
1.  Software-only high availability solution
Evidian SafeKit is a software-only high availability solution. This solution secures easily and quickly the 24x7 operation of your critical
applications. While traditional high availability solutions are focused on the hardware failover of physical servers, SafeKit has chosen to
focus on the hardware and software failover of critical applications.

2.  High availability which targets all types of failures


The unavailability of an application can be due to 3 types of problems:
• Hardware and environment: including the complete failure of a computer room (20%).
• Software: regression on software update, overloaded service, software bug (40%).
• Human errors: administration error and inability to properly restart a critical service (40%).

SafeKit addresses these issues, which are all essential to ensure the high availability of critical applications.

3.  The 3 best use cases of software clustering


After over 15 years of 24x7 experience, SafeKit is the preferred clustering solution on the market in three cases:
• A software publishing company can add SafeKit to its application suite as a software OEM high availability option.
• A distributed enterprise can deploy a high availability solution on standard hardware without the need for specific IT skills.
• A data center can provide high availability for multiple applications with a uniform solution on Windows, Linux and with load balancing,
real time data replication and failover between two remote sites.

4.  Unique on the market: 3 products in 1


Traditionally, three different products are necessary to create an application cluster:
• load balancing network boxes,
• disk bays replicated synchronously on a SAN for data availability,
• high-availability toolkits for application failure recovery.

SafeKit offers these three features within the same software product.
To further reduce implementation costs, SafeKit runs on your existing physical or virtual servers and with the standard editions of OS and
databases: Windows, Linux, Microsoft SQL Server, Oracle, Firebird, MySQL, PostgreSQL or other databases or flat files… and even Windows
editions for PCs!

5.  A solution suited for Cloud environments


Application high availability with SafeKit can be deployed either in virtual machines or on physical servers: the configuration does not
change. SafeKit also offers replication and failover of full virtual machines between two Hyper-V physical servers. The solution is simple and
economical because it requires no shared disk and works with the free edition of Hyper-V.

6.  Plug and play deployment of a software cluster


Once a failover module is configured and tested for an application, deployment requires no specific IT skills. Just install the application, the
SafeKit software and the failover module on two standard Windows, Linux servers.

7.  Rich choice of application integration inside a software cluster


SafeKit proposes different types of software clusters. Cluster configuration for a given application is rich and is made with one or several
application modules. SafeKit proposes mirror modules (primary/secondary with replication and failover), farm modules (network load
balancing and failover), and mixed of several modules.
A module is configured with the server IP addresses for heartbeats, the virtual IP address of the cluster, the load balancing rules for a farm
module, the file directories to replicate for a mirror module, the hardware and software failure detectors and the service to restart in case of
failure.

06 - High Availability (HA) Choosing a Solution


8.  User-friendly administration to avoid human error
SafeKit provides a centralized administration web console. An administrator can remotely monitor status of applications on different
clusters and act with simple buttons (start, stop).
The documentation provides tests to validate the proper functioning of the application in high availability mode and includes
troubleshooting procedures.
Using the SafeKit generic command line interface, monitoring of critical applications protected by SafeKit fits easily into specific
administration consoles (Patrol, Microsoft SCOM, Nagios…).

9.  Synchronous replication for transactional applications


SafeKit’s synchronous real time replication function strengthens high availability and prevents data loss. With this mechanism, a data
committed on a disk by a transactional application is replicated on the secondary machine.
Application servers can be located in geographically remote computer rooms through an extended LAN to withstand the loss of a full
room.

10.  Implement your first software cluster in 1 hour


You can test SafeKit for free. In 1 hour, you can implement your first software cluster on two virtual or physical machines.

High Availability (HA) Choosing a Solution - 07


About Atos
Atos is a global leader in digital transformation
with approximately 100,000 employees in
73 countries and annual revenue of around
€ 12 billion. European number one in Big
Data, Cybersecurity, High Performance
Computing and Digital Workplace, the Group
provides Cloud services, Infrastructure &
Data Management, Business & Platform
solutions, as well as transactional services
through Worldline, the European leader
in the payment industry. With its cutting-
edge technologies, digital expertise and
industry knowledge, Atos supports the
digital transformation of its clients across
various business sectors: Defense, Financial
Services, Health, Manufacturing, Media,
Energy & Utilities, Public sector, Retail,
Telecommunications and Transportation.
The Group is the Worldwide Information
Technology Partner for the Olympic &
Paralympic Games and operates under the
brands Atos, Atos Consulting, Atos Worldgrid,
Bull, Canopy, Unify and Worldline. Atos SE
(Societas Europaea) is listed on the CAC40
Paris stock index.

Find out more about us


atos.net
CT_180619_EM_HIGH-AVAILABILITY-CHOOSING-SOLUTION-EVIDIAN-EN

For more information: Evidian.com

All trademarks are the property of their respective owners. Atos, the Atos logo, Atos Codex, Atos Consulting, Atos Worldgrid, Bull, Canopy,
equensWorldline, Unify, Worldline and Zero Email are registered trademarks of the Atos group. Atos reserves the right to modify this document
at any time without notice. Some offerings or parts of offerings described in this document may not be available locally. Please contact your local
Atos office for information regarding the offerings available in your country. This document does not represent a contractual commitment.
June 2018. © 2018 Atos

You might also like