The growth of the Internet has led to the development of critical network services where erroneous processing or outages are unacceptable. The availability and reliability of services such as online banking, stock trading, reservation processing, and online shopping, have become increasingly important as their popularity grows. Downtime and failures lead to unsatisfied customers and translate directly into lost revenue for the service providers.
Fault-tolerance techniques use redundant components and/or redundant processing to ensure continued correct operation despite component failures. Most existing fault-tolerance solutions for network services do not provide fault-tolerance for active connections at failure time, expect servers to be deterministic, or require changes to the clients. These limitations are unacceptable for many current and future network service applications. We propose a methodology for providing fault-tolerance without the limitations mentioned above. Our solution, based on a standby backup approach, is transparent to the clients and requires minimal changes to the server OS and application.
We have used our methodology to add fault-tolerance features to two popular types of network services---web service and video conferencing. Off-the-shelf hardware and software components were used as the basis for both implementations. Modifications to the OS network stack using Linux kernel modules provide fault-tolerance at the connection level. At the application level, modifications to the web server and multi-conferencing unit, respectively, provide application-level synchronization and allow handling of non-deterministic server behavior. The associated issues, challenges, and tradeoffs of our methodology are presented in this work. The evaluation of our prototype implementations shows that client-transparent fault-tolerance can be achieved with relatively low overheads.
Index Terms
- Transparent fault-tolerant network services using off-the-shelf components
Recommendations
Middleware-Based Failure Detection and Recovery Services for Fault-Tolerant E-services
DESE '09: Proceedings of the 2009 Second International Conference on Developments in eSystems EngineeringThe runtime detection of failure and recovery from failure is a major challenge facing e-business and e-commerce applications. Different types of failure are well understood through the failure model, but the detection and differentiation between these ...
Fault Tolerant Video on Demand Services
ICDCS '99: Proceedings of the 19th IEEE International Conference on Distributed Computing SystemsThis paper describes a highly available distributed video on demand (VoD) service which is inherently fault tolerant. The VoD service is provided by multiple servers that reside at different sites. New servers may be brought up ``on the fly'' to ...