FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS

ON FAULT TOLERANCE OF
RESOURCES IN
COMPUTATIONAL GRIDS
Prepared By::
Dave Maurvi Y.
ME CSE

Agenda
• Introduction
• Grid Computing
• Faults & Failure in Grid
• Fault Tolerance Techniques
• Future Enhancement

Introduction
 Grid computing is the collection of computer resources from
multiple locations to reach a common goal.
 One of the main strategies of grid computing is to use middleware
to divide and apportion pieces of a program among several
computers, sometimes up to many thousands.

GRID COMPUTING
• Grids are a form of distributed computing whereby
a “super virtual computer” is composed of many
networked loosely coupled computers acting together to
perform large tasks.
• Grid size varies a considerable amount.
• Grid Computing forms virtual organization with
geographically distributed hardware and software
infrastructure.

• This infrastructure has flexible, secure and
coordinated shared vast amounts of
heterogeneous resources from multiple
administrative domains.
• The nodes in grid computing are easily
combined to produce a similar computing
resource like multiprocessor supercomputer but
at a lower cost.

• Due to unavailability of network or development
difficulty or faulty resources, fault may occur in the
results or performance may be degraded.

FAULTS AND FAILURES IN GRID
• Important terms
• Fault: A fault is a violation of a system’s
underlying assumptions.
• Error: An error is an internal data state that
reflects a fault.
• Failure: A failure is an externally visible
deviation from specifications.

FAULTS AND FAILURES IN GRID
• Fault tolerance is an important feature to be
taken care of that detects errors and recovers
them without participation of any external
agents, such as humans.
• The more resources and components involved
the more complicated and error-prone becomes
the system.

• Physical faults: faulty storage, faulty CPUs, faulty memory.
• Unconditional termination: Mostly, user pressed Ctrl+c.
• Network faults: packet corruption, faults due to network
partition, packet loss.
• Lifecycle faults: Legacy or versioning faults.
• Processor faults: Machine or operating system crashes.
• Media faults: Disk head crashes.
• Service expiry fault: The service time of a resource may
expire while application is using the resources in grid.
• Process faults: software bug, resource shortage.
• Interaction faults: timing overhead, protocol
incompatibilities, security incompatibilities, policy problems.
FAULTS

Three types of behaviors are possible in systems
after a failure:
• Failstop system: The system does not output any data once it
has failed. It immediately stops sending any events or
messages and does not respond to any messages.
• Failfast system: The system behaves like a Byzantine system
for some time but moves into a failstop mode after a short
period of time. It does not matter what type of fault or failure
has caused this behavior but it is necessary that the system
does not perform any operation once it has failed.
• Byzantine system: The system does not stop after a failure,
instead behaves in a inconsistent way. It may send out wrong
results of the application.

FAULT TOLERANCE TECHNIQUES
Job and Data Replication
• In grid environment, job/task or data are replicated to
tackle the faults.
• Algorithms used for this: Adaptive Job Replication (AJR)
algorithm, Backup Resources Selection (BRS) algorithm
Fault Tolerance using Adaptive Replication in Grid
Computing (FTARG) is an adaptive replication
middleware which addresses the fault tolerance of grid
based applications by providing data replication at
different sites.

• FTARG enables data synchronization between multiple
heterogeneous databases located in the grid by
supporting a variety of synchronization modes.
• Resubmission technique is based on a combination of
task replication and task resubmission using a
resubmission impact metric which measures the impact
of repeated task resubmission on the execution time of a
workflow.

Checkpointing:
• A Fault Tolerance and Recovery component that extends
the Active BPEL workflow engine has been proposed to
develop mechanisms for building an autonomic
workflow management system that effectively detects,
diagnoses, notifies, reacts and recovers automatically
from failures during workflow execution.
• The default behavior of Active BPEL can be modified in
order to recover a process from a faulty state, using a
non-intrusive checkpointing mechanism.

• Resource Fault Occurrence History (RFOH) is the
strategy maintains the history of fault occurrence of
resources in Grid Information Server (GIS).
• A resource broker with jobs to schedule, it uses this GIS
information in Genetic Algorithm and looks for a near
optimal solution for the problem.
• A proposition is also there to present an experience to
endow with fault tolerance support parallel executions
on grids through the integration of ComPiler for
Portable Checkpointing (CPPC), a checkpointing tool for
parallel applications, and GridWay: a meta-scheduler
provided with the Globus Toolkit.

Scheduling/Agent based migration
• Scheduling policies for grid systems can be classified into
space sharing and time sharing policies.
• In fault-tolerant scheduling, primary-backup approach is a
practiced methodology used for fault tolerance where each
task holds a primary copy and a backup copy submitted to two
different processors.
• Two algorithms: the Minimum Replication Cost with Early
Completion Time (MRC-ECT) algorithm and the Minimum
Completion Time with Less Replication Cost (MCT-LRC)
algorithm has been proposed to schedule backups of
independent jobs and dependent jobs, respectively

Load balancing
• A fault tolerant policy has been proposed to balance loads
dynamically in the P2P grid system, named the Fault
Tolerant policy on Dynamic Load Balancing (FTDLB)
• Load balancing strategies are classified as dynamic and
static. In general, the static load balancing strategy
needs the prior information to make decisions, such as
the execution rate of each node, for load distribution

• On the other hand, the dynamic load balancing strategy
exploits the system information to make decisions at run
time.
• Load balancing strategies could also be categorized as
centralized or decentralized.
• The centralized strategy selects a single processor to
handle load scheduling, while the distributed strategy
welcomes each participating node to handle load
balancing.

Global behaviour modelling
• Theoretically they are considered as single elements but,
when it comes to practice, especially in management
related issues, they are considered as a set of
independent, loosely related elements.
• Elements like CPUs, memory and its controllers, video
cards, hard drives; network interfaces etc. have distinctive
functionalities and are technologically complex.
• Fault tolerance techniques in grid systems can be split
into two categories:
1.resource-level (focused on every machine)
2.service-level (focused on global behavior).

CONCLUSION AND FUTURE WORK
• Checkpoint fixation level: either at system level (i.e. at OS or
middleware level) or at application level.
• In-transit and Orphan message management with checkpoint:
latency and resources held up for this reason would be freed if applied with
a suitable policy.
• Scope of Checkpoint: local – for each process instance or global – for
each parallel program in execution.
• Storage space requirement for checkpointing: light – only the
first/top level assignment is stored thereby less storage and communication
overhead and heavy – in addition to light, newly learnt clauses saved atop
the decision stack.
• Granularity of checkpointing: full – entire state of application saved
and incremental –application state saved from previous checkpoints only.

REFERENCES
• Asgarali Bouyer, Abdul Hanan Abdullah, Hasan Ebrahimpour and
Firouz Nasrollahi," Fault-Tolerance Scheduling,2009, IEEE
Computer Society.
• Elvin Sindrilaru, Alexandru Costan and Valentin Cristea, "Fault
Tolerance and Recovery in Grid Workflow Management Systems,"
International Conference on Complex, Intelligent and Software
Intensive Systems, 2010, IEEE Computer Society.
• Ian Foster, “What is the Grid? A Three Point Checklist”, Argonne
National Laboratory & University of Chicago, July 20, 2002.

FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS

More Related Content

FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS